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Abstract 

As  the  concept  of  war  has  evolved,  navigation  in  urban  environments  where  GPS 
may  be  degraded  is  increasingly  becoming  more  important.  Two  existing  solutions 
are  vision-aided  navigation  and  vision-based  Simultaneous  Localization  and  Mapping 
(SLAM).  The  problem,  however,  is  that  vision-based  navigation  techniques  can  re¬ 
quire  excessive  amounts  of  memory  and  increased  computational  complexity  resulting 
in  a  decrease  in  speed.  This  research  focuses  on  techniques  to  improve  such  issues 
by  speeding  up  and  optimizing  the  data  association  process  in  vision-based  SLAM. 
Specihcally,  this  work  studies  the  current  methods  that  algorithms  use  to  associate  a 
current  robot  pose  to  that  of  one  previously  seen  and  introduce  another  method  to 
the  image  mapping  arena  for  comparison.  The  current  method,  Zed-trees,  is  efficient  in 
lower  dimensions,  but  does  not  narrow  the  search  space  enough  in  higher  dimensional 
datasets.  In  this  research,  Kernelized  Locality-Sensitive  Hashing  (KLSH)  is  imple¬ 
mented  to  conduct  the  aforementioned  pose  associations.  Results  on  KLSH  shows 
that  fewer  image  comparisons  are  required  for  location  identiheation  than  that  of 
other  methods.  This  work  can  then  be  extended  into  a  vision-SLAM  implementation 
to  subsequently  produce  a  map. 
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KERNELIZED  LOCALITY  SENSITIVE  HASHING 

EOR 

EAST  IMAGE  LANDMARK  ASSOCIATION 

I.  Introduction 

Current  military  and  commercial  operations  already  implement  some  level  of 
autonomy  in  today’s  systems  [61].  From  unmanned  aerial  vehicles  (UAV)  to  un¬ 
manned  ground  vehicles  (UGV),  great  successes  have  been  made  to  remotely  carry 
out  missions.  UAVs  are  used  today  in  military  combat  missions  for  intelligence,  re¬ 
connaissance  and  surveillance  (ISR)  and  even  close  air  support  and  air  interdiction 
as  demonstrated  by  the  RQ-1  Predator.  UGVs,  such  as  the  iRobot  510  Packbot, 
provide  military  ground  troops  with  reconnaissance,  surveillance,  and  hrst  responder 
capabilities  such  as  explosive  and  other  hazardous  material  detection  and  disposal. 

Increasing  the  autonomy  in  these  types  of  vehicles  as  well  as  others  requires 
computer  vision  [60].  This  comes  in  many  forms  but  overall  narrows  down  to  the 
ability  to  use  sensors  to  gather  details  about  the  environment  and  make  decisions 
based  off  of  those  details  without  human  interaction.  For  instance,  in  object  detection, 
a  sensor  may  detect  an  object  such  as  a  tree.  The  robot  will  use  this  information 
to  decide  whether  it  can  pass  through  obstruction,  over  or  around  it.  The  Boston 
Dynamics  LS3  scheduled  for  deployment  in  2012  is  one  such  robot  having  the  expected 
capability  to  follow  a  leader  or  travel  to  designated  locations  with  the  aid  of  GPS 
all  while  carrying  large  loads  over  long  distances  [26].  To  obtain  and  improve  on 
large  scale  implementations  such  as  this,  autonomous  navigation  and  computer  vision 
through  the  use  of  images  is  key  [60].  On  a  much  smaller  scale,  research  has  been 
completed  using  computer  vision  to  conduct  tasks  such  as  vision-aided  navigation 
(VAN)  [56],  [68],  [79].  One  VAN  subset  in  particular,  robotic  mapping,  has  made 
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great  strides  since  first  developed  in  the  late  1980s  [69].  Entire  environments  can  be 
mapped  from  probabilistic  approaches  using  many  different  types  of  sensors. 

The  concept  of  robotic  mapping  starts  with  the  localization  problem,  which 
includes  position  tracking,  global  localization  and  solving  the  kidnapped  robot  prob¬ 
lem  [41],  [77].  Localization  requires  the  use  of  sensors  to  match  details  that  a  robot 
currently  sees  with  details  that  are  known  about  the  environment.  From  there,  the 
theory  progresses  to  the  ability  to  create  a  map  of  an  environment  without  any  prior 
knowledge.  This  thesis  focuses  on  techniques  to  aid  the  mapping  process  through  the 
use  of  images. 

1 . 1  Overview 

Consider  an  agent  in  an  unknown  environment.  As  it  randomly  wanders,  each 
instance  is  an  observation,  and  each  observation  an  input  to  an  overall  map.  While 
these  observations  and  maps  take  many  forms  which  will  be  discussed  throughout 
this  thesis,  the  point  is  that  the  recognition  of  observations  help  an  agent  to  learn 
the  boundaries  and  obstacles  of  this  unknown  environment  throughout  exploration 
and  to  localize  itself.  Those  three  concepts  sum  up  the  entire  environment  learning 
process:  exploration,  localization,  and  mapping  [44] .  Exploring  an  environment  while 
using  observations  to  localize  and  create  a  map  is  known  as  dead  reckoning.  This 
involves  using  observations  such  as  odometry  to  calculate  change  in  location  from  its 
original  starting  position.  This  process  has  no  memory  and  therefore  cannot  recognize 
repeated  terrain.  This  concept  of  recognizing  previously  traveled  terrain  is  known  as 
Simultaneous  Localization  and  Mapping  (SLAM)  [25],  [58].  However,  the  problem  in 
image  recognition  or  more  specihcally,  re-recognition  is  that  objects  in  an  environment 
(and  observations  in  general)  are  viewed  differently  depending  on  lighting,  orientation, 
and  overall  position  compared  to  the  last  time  that  object  was  viewed.  Therefore,  if  a 
scene  in  an  environment  that  was  captured  previously  is  recognized  as  new,  the  map 
reflects  this  as  unchartered  territory.  In  order  to  conduct  SLAM,  the  agent  needs  to 
be  able  to  efficiently  match,  or  re-recognize  observations  to  a  database  of  observations. 
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This  is  known  as  the  data  association  problem  and  is  discussed  in  detail  throughout 
this  thesis. 

There  are  many  types  of  observations  agents  collect  such  as  vehicle  odometry 
information,  inertial  measurements,  range  information  from  lasers  and  sonar,  and  im¬ 
ages.  This  thesis  focuses  on  images,  inertial  measurements  and  odometry.  The  images 
are  used  to  conduct  the  data  association,  while  all  three  can  be  used  to  subsequently 
produce  a  map. 

1.2  The  Data  Association  Problem 

The  data  association  problem  is  best  explained  by  comparing  human  to  com¬ 
puter  vision.  Given  two  images  of  a  particular  scene,  a  human  identihes  details  of 
what  is  specihcally  in  the  images  and  how  they  compare  in  terms  of  objects,  people, 
places,  structures,  etc.  However,  computer  vision  uses  features  based  on  contrast, 
orientation,  scale,  etc.  Therefore  a  human  may  describe  an  image  as  being  a  house 
with  a  car  in  the  driveway  while  a  computer  would  describe  a  particular  point  on  the 
edge  of  the  roof  at  which  a  drastic  contrast  change  occurs  between  the  roof  and  the 
sky.  This  description  is  known  as  a  feature. 

In  comparing  images,  a  human  would  see  that  the  same  objects  in  the  hrst  image 
are  seen  in  the  second  and  therefore  are  of  the  same  scene.  A  computer  calculates  the 
spatial  distance  between  similar  features  in  the  images.  If  there  are  a  lot  of  feature 
matches  between  the  images,  then  there  is  a  high  probability  that  the  two  images  are 
of  the  same  scene. 

Computer  vision  methods  therefore  must  compare  every  feature  in  an  image  to 
every  other  feature  that  the  agent  has  discovered  thus  far  to  determine  whether  the 
location  is  new  or  previously  seen.  This  however  requires  every  feature  discovered  to 
be  stored.  Feature  storage  for  data  association  can  require  large  amounts  of  disk  space 
and  memory;  and  methods  to  organize  them  can  be  computationally  expensive.  These 
methods  typically  take  on  the  form  of  data  structures  such  as  trees  or  hash  tables. 
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Subsequently,  the  correspondence  piece  of  data  association  can  be  computationally 
expensive  as  well,  because  finding  an  exact  match  out  of  many  requires  analysis  of 
every  entry  in  the  database. 

1.2.1  Exact  Match  vs  k-Nearest  Neighbors.  Sometimes,  the  most  computa¬ 
tionally  efficient  solution  instead  of  hnding  an  exact  match  is  to  hnd  a  few  close  ones. 
These  are  known  as  nearest  neighbors  to  the  query  [52].  The  idea  is  that  there  should 
be  a  high  probability  that  the  exact  match  is  among  these  neighbors.  As  alluded  to 
earlier,  matches  among  features  are  made  by  spatial  distance  calculations.  By  visu¬ 
alizing  each  feature  as  some  d-dimensional  point  in  3?^  space,  the  distance  between 
two  points  corresponds  to  their  spatial  distance.  Common  distance  metrics  in  image 
mapping  are  Euclidean  and  Mahalanobis  distance.  Given  a  query  feature,  q,  and  a  set 
of  database  features,  {pi,  ...,pjv},  an  exact  match  search  would  hnd  the  point,  p  such 
that  the  distance  between  the  query  and  the  match  is  below  a  threshold  and  closer 
than  all  other  points.  This,  however,  does  not  guarantee  a  correct  match;  therefore, 
nearest  neighbor  searches  are  implemented  to  identify  the  closest  k  distances  to  the 
query.  The  best  choice  of  k  to  ensure  the  correct  match  is  found  is  problem  dependent. 

1.2.2  k-d  trees.  Currently  the  most  common  method  to  conduct  data  asso¬ 
ciation  for  image  features  is  through  the  use  of  k-d  trees  [66],  [5],  [35],  [13].  k-d  trees 
are  a  binary  tree  search  method  of  conducting  the  data  association.  The  variable  k  in 
this  instance,  however,  refers  to  the  dimensionality  of  the  data  rather  than  the  number 
of  nearest  neighbors.  These  have  been  shown  to  produce  successful  data  associations 
in  databases  with  hundreds  of  thousands  of  image  features.  The  problem  with  trees 
however,  is  that  they  require  re-balancing  and  re-pruning  as  they  grow  larger  which 
can  be  computationally  expensive.  They  are  built  with  complexity  0{N)  in  which 
N  is  the  time  and  complexity  it  takes  to  complete  1  operation.  Tree  maintenance  is 
then  conducted  with  0{NlogN)  [52].  As  the  number  of  features  continuously  grows, 
they  can  also  become  overcrowded  which  adversely  affects  the  data  association  [66]. 
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1.2.3  Kernelized  Locality- Sensitive  Hashing.  Kernelized  Locality-Sensitive 
Hashing  (KLSH)  [42]  is  a  hash  table-based  method  of  conducting  the  data  association. 
An  extension  of  Locality-Sensitive  Hashing,  KLSH  is  a  robust  method  that  conducts 
the  nearest  neighbor  search  of  a  query  by  associating  matching  or  near-matching 
features  to  the  same  location  in  a  lookup  table.  Furthermore,  differing  from  its  parent 
methods,  it  calculates  all  of  the  hashes  in  kernel  space.  This  method  has  been  used 
previously  on  nearest  neighbor  search  of  images  in  many  dimensions  from  Internet 
datasets  such  as  Flickr  as  well  as  others  [42].  This  research  analyzes  the  robustness 
of  this  method  in  the  extension  of  its  use  in  data  association  for  image  mapping 
purposes. 

1.3  Goals 

This  research  focuses  on  improving  the  data  association  techniques  needed  for 
successful  mapping  in  vision-SLAM  through  the  use  of  KLSH.  The  goals  of  this  re¬ 
search  are  to: 

•  Successfully  implement  KLSH  as  a  data  association  technique  for  use  in  vision- 
based  SLAM 

•  Determine  the  metric  position  estimate  as  a  result  of  the  data  association  output 

•  Compare  the  correspondence  accuracy  of  commonly  used  SIFT  features  vs  HOG 
features 

•  Determine  the  best  parameter  set  for  the  KLSH  algorithm 

•  Determine  whether  the  best  choice  of  the  number  of  nearest  neighbors  is  depen¬ 
dent  upon  the  overall  size  of  the  dataset 

1.4  Results 

The  contributions  of  this  research  are  the  test  results  which  provide  an  analysis 
of  the  robustness  of  using  KLSH  to  accomplish  the  data  association  in  image  mapping 
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implementations  such  as  vision-based  SLAM  and  the  recommendations  of  KLSH  pa¬ 
rameter  settings  for  use  in  SLAM.  The  following  provides  the  areas  of  focus  in  which 
the  method  will  be  evaluated  on: 


•  Recommend  the  best  choices  for  the  number  of  neighbors,  k,  required  to  suc¬ 
cessfully  associate  a  query  feature  to  a  matching  feature  in  an  entire  database 

•  Evaluate  performance  in  the  context  of  probability  of  accurate  match 

•  Evaluate  performance  in  the  context  of  memory  and  storage  requirements 

1.5  Thesis  Outline 

Chapter  II  Erst  discusses  vision-SLAM  algorithms,  then  details  other  methods 
to  conduct  each  stage  of  the  SLAM  process.  Then  the  background  information  for 
this  implementation  is  explained.  Chapter  III  details  how  the  data  association  im¬ 
plementation  was  conducted,  while  Chapter  IV  reports  implementation  test  results. 
Finally,  future  work  is  discussed  in  Chapter  V. 
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II.  Related  Work 


While  in  the  past  SLAM  and  other  mapping  algorithms  have  been  implemented  using 
a  variety  of  sensor  inputs  to  include  range  information  from  lasers  [27],  [28],  [58]  and 
sonar  [44],  this  thesis  explores  methods  using  images.  Vision-SLAM  is  not  as  widely 
used  as  other  methods  because  of  the  complexity  required  in  image  processing  [16]. 
The  memory  required  to  store  features  and  the  time  required  for  processing  typically 
prohibits  online  capability  or  requires  trade-offs  in  accuracy.  Vision-SLAM  algorithms 
vary  in  the  types  of  additional  sensors  used,  number  and  placement  of  cameras  and 
how  the  pose  estimation  is  rehned  to  create  a  map. 

Many  SLAM  implementations  use  other  sensor  inputs  to  calculate  pose  estima¬ 
tion  using  the  images  for  verihcation  and  weighting.  However,  research  has  been  done 
using  vision  as  the  only  sensor  in  conjunction  with  odometry  such  as  [9]  (only  cam¬ 
era  was  used  for  localization),  [22],  [35],  [40],  [48],  [67],  [78].  Stereo  implementations 
typically  use  either  epipolar  geometry  to  match  feature  movement  between  cameras 
or  visual  geometry  to  calculate  interframe  movement  [10],  [13],  [18],  [64].  Other  im¬ 
plementations  use  multiple  cameras  to  obtain  a  360°  view  with  omni-directional  sens¬ 
ing  [39]  as  well  as  use  features  from  hallway  ceilings  [36].  As  the  quality  of  features 
that  can  be  pulled  depends  on  the  environment,  features  extracted  from  outdoor  en¬ 
vironments,  open  and  monotonous  areas  in  particular  are  generally  not  the  strongest 
in  tracking  and  are  sparse.  Therefore  most  vision-based  SLAM  research  has  been  con¬ 
ducted  indoors,  but  outdoor  implementations  have  been  implemented  as  well  [5],  [48]. 
In  addition  to  the  solving  the  data  association  problem,  there  also  exists  the  loop  clos¬ 
ing  problem.  This  is  the  issue  of  determining  whether  a  feature  is  new  or  previously 
seen  needing  association.  Both  [11]  and  [57],  via  different  methods,  implement  loop 
closure  techniques  through  the  SLAM  system  learning  and  classifying  the  appearance 
of  all  the  features  seen  thus  far  as  one  subset.  The  sections  that  follow  discuss  common 
methods  to  derive  the  final  map  through  pose  estimation. 
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2.1  SLAM  Methods 


2.1.1  Extended  Kalman  Filter.  The  (EKF),  per  the  name,  is  an  extension 
of  the  Kalman  hlter  but  allows  for  non-linear  assumptions  to  be  made  in  modeling. 
As  with  similar  hlters  to  be  discussed  in  this  section,  they  all  begin  with  the  state 
vector,  Xt-  Shown  below,  the  2D  state  vector  consists  of  robot  pose  and  maps  at  time, 
t 


Xt  =  {st.mKY 


Sy,ti  ^0,ti  ^2y^ti  ■■■■>  ^Ky^y 


(2.1) 


These  poses,  St  consist  of  all  x  and  y  robot  locations  at  each  time  instance  as  well 
as  the  orientation,  9.  The  map,  mK,  at  each  time  instance  consists  of  the  coordinate 
locations  all  of  the  features  or  landmarks  observed  at  the  corresponding  robot  pose, 
while  K  denotes  the  number  of  features  in  the  map.  Therefore  the  each  state  vector 
is  a  2K  +  3  length  vector.  This  example  denotes  the  state  vector  for  2D  position  and 
estimation,  however,  this  expression  can  be  extended  to  higher  dimensions. 

To  calculate  and  track  the  pose,  the  Kalman  hlter  is  basically  a  Bayesian  hlter 
that  represents  each  posterior  as  a  jointly  Gaussian  distribution  given  by 


p{xt\zt,ut)  =  p{st,mK\zt,Ut)  (2.2) 

assuming  xt  is  a  Gauss-Markov  process.  This  states  that  the  jointly  Gaussian  repre¬ 
sentation  of  the  robot  pose  and  map  is  represented  by  parameters  Ut  of  size  2K  +  3 
dimensions  and  the  covariance  matrix,  Sj  of  size  {2K  +  ?))  x  {2K  +  ?)).  Zt  represents  the 
sensor  data  providing  the  odometry  information  used  to  re-estimate  robot  location. 
The  problem  with  the  Kalman  hlter  lies  within  its  assumptions.  The  key  assumption 
is  that  the  motion  model,  or  function  that  estimates  the  state  vector  at  time  f,  is 
linearly  dependent  upon  the  previous  pose  and  map  at  time  t  —  1  with  added  Gaus¬ 
sian  noise.  As  with  most  state  transitions  and  models,  this  linearity  relationship  isn’t 


common.  Typically,  a  robot  such  as  described  in  this  paper  moves  in  a  circular  trajec¬ 
tory  [70]  resulting  in  a  non-linear  pose  function.  Additionally,  some  sensors  may  be 
appropriately  modeled  as  non-linear  functions  of  the  pose  as  well.  To  resolve  this,  the 
model  is  approximated  using  the  Taylor  Series  expansion  [69].  This  approximation 
is  the  driving  force  in  the  extended  Kalman  hlter  relaxing  the  linearity  assumption. 
This  allows  the  motion  model  of  transitions,  Xt  between  time  t  —  \  and  t  as  well  as 
measurements,  Zt  to  be  characterized  as: 


Xt  = 

{Z.6) 

Zt  =  h{xt,mK)  +  vt 

in  which  /  is  the  motion  model  using  previous  states,  X(^t-i)  and  input  controls,  Ut 
to  estimate  the  current  state,  Xt-,  and  h  is  the  observation  model  of  each  observed 
landmark,  rriK,  at  pose  Xt-  Wt  and  Vt  represent  the  zero- mean  uncorrelated  Gaussian 
noise  accounting  for  the  unmodeled  errors  within  the  motion  model  and  observations, 
respectively.  The  general  method  for  the  EKF  algorithm,  as  applied  to  SLAM,  is 
shown  in  Fig.  2.1 

EKF  Algorithm 

1.  Obtain  initial  state  estimate  using  odometry  data 

2.  Observe  and  add  landmarks  to  the  state 

3.  After  moving,  update  the  new  current  state  using  the  odometry  data 

4.  Update  the  estimated  state  from  re-observing  the  landmarks 

5.  Add  new  landmarks  to  the  current  state 

Figure  2.1;  EKF  Algorithm. 

Each  estimate  step  results  in  a  calculation  of  the  state  vector  consisting  of  robot 
pose  and  observed  map  landmarks  as  well  as  a  covariance  matrix  P  representing  the 
following  covariances  [58]: 

The  update  step  then  yields  the  following  EKF  equations  as  described  by  [54] 
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Pt\t 


p  p 

■‘■XX  ■*■  xra 

P  P 

■‘  xm  ■*■  mm 


t\t 


(2.4) 


^0, 1)3^3 

Cov^rrii.Sf;) 

Cov{St,mi) 

=  [xt\t-i  +  Wt  [zt  -  h{xt\t-i,mt-i)] 

Pt\t  =  Pt\t-i  -  WtStWt^  .  (2.5) 

St  =  VhPtitVh^  +  Rt 
Wt  =  Ptit-iVh^sr^ 


where  Vh  is  the  Jacobian  of  h  calculated  at  Xt\t-i  and  fht-i.  zt  —  h{xt\t-i,mt-i) 
is  called  the  innovation  and  is  dehned  as  the  difference  between  the  estimation  and 
the  actual  observation.  St  is  then  dehned  as  the  innovation  covariance  while  Wt  is  the 
Kalman  gain.  The  Kalman  gain  is  a  weighting  applied  to  update  calculations  that 
directly  affects  how  much  each  observed  landmark  should  be  trusted  in  relation  to 
the  error  calculations  that  are  made. 


2.1.2  Particle  Filter.  Particle  hlters  represent  a  distribution  by  sampling 
from  that  distribution.  They  can  now  be  considered  approximate,  nonparametric  and 
represent  any  distribution,  even  Gaussian  if  needed.  First  we  denote  the  particles, 
which  are  the  samples  from  the  posterior  distribution  or  more  specihcally,  the  state 
vectors: 
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[1]  [2]  [3] 

'\/  -  /y»  >-  J  /y»  >-  J  /y» ■' 

At  —  •I' t  ) 


..,xl 


[M] 


(2J) 


Here  M  is  the  number  of  particles  in  the  set  Xt-  Typically  this  number  is  large  (i.e., 
tens  of  thousands).  The  likelihood  that  a  particular  state  hypothesis,  Xt  to  be  among 
the  particle  set  can  be  found  using  its  Bayes  hlter  relationship; 


4™^  (2.7) 

[70]  explains  that  this  relationship  remains  true  theoretically  as  M  — )■  cx),  but  M  > 
100  is  typically  suitable  in  practice.  At  each  time  t,  a  new  sample  hypothesis  is 
generated  for  each  particle  from  the  distribution  x^J^^  p{x\^\utTx\_im])^  based 
on  the  previous  particle  as  well  as  the  control  used,  Ut- 

Next  the  importance  factor  or  particle  weight,  is  found. 


=  p{zt\x\m])  (2.8) 

tb 

By  examining  this  likelihood,  the  weight  represents  the  possibility  that  the  rrv'-^  par¬ 
ticle  is  the  correct  representation  of  the  environment  considering  the  observed  mea¬ 
surements,  Zt  and  the  known  statistics,  xf^\  Next  the  particles  are  resampled.  A 
new  temporary  particle  set,  Xt  is  formed  by  randomly  drawing  with  replacement  M 
particles.  The  probability  of  drawing  the  particle  is  a  function  of  its  weight, 

By  drawing  M  random  uniformly  distributed  numbers  on  interval  [0, 1]  and 
selecting  the  particle  that  corresponds  to  the  weight  range  of  that  random  number, 
a  new  particle  set  will  be  formed.  For  each  m,  the  old  particle  will  be  replaced  with 
the  drawn  particle  and  reassigned  a  weight  of  A.  Ideally,  since  this  step  is  conducted 
with  replacement,  the  particles  with  higher  importance  or  stronger  weights  tend  to 
be  chosen  more  while  the  particles  of  less  importance  are  eliminated. 
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2.1.3  Rao-Blackwellized  Particle  Filter.  The  Rao-Blackwellized  Particle 
Filter  RBPF  is  a  Monte  Carlo  filtering  technique.  The  drawback  of  the  particle  hlter 
is  that  in  high  dimensional  spaces,  sampling  can  be  inefficient.  Rao-Blackwellisation 
is  the  recognition  that  the  model  is  tractable  and  can  be  analytically  marginalized  out 
or  represented  with  only  the  necessary  segments  of  the  data.  This  leads  directly  in  to 
the  advantage.  In  following  this  strategy,  the  size  of  the  space  requiring  sampling  is 
greatly  reduced. 

Like  ordinary  particle  hlters,  the  state  space  is  generalized  as 


viyo-.tW.t)  =  vijjo-.t-i 


I  Mzt\yt)v{yt\yt-i) 

\zi-.t-ij — -f — ^ ^ — 


(2.9) 


where  yt  and  Zt  are  the  state  vectors  and  observations  respectively.  The  fundamental 
piece  in  the  Rao-Blackwellisation  is  that  the  state  space  is  divided  into  two  groups, 
rt  and  Xt  resulting  in 


p{yt\yt-i)  =  p{xt\rt-i:t,xt-i)p{rt\rt-i).  (2.10) 

This  allows  x^.^t  to  be  marginalized  out  of  the  conditional  posterior  distribution  as 
shown  in  Equation  (2.11) 


p{.xo.,tyi-.t,rQ.,t)  p{r-Q.,t\zi.,t),  (2.11) 

which  is  just  a  fraction  of  the  entire  sample  space.  The  importance  sampling  then 
becomes 

{rT)<l{nK.t_,,z,.,)  (2.12) 


with  weights 


wr  = 


P{rZWt) 


(2.13) 
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and  normalized  by 


=  Wt 


■  N 

E 

Lj=i 


n  -1 


m 


(2.14) 


In  selection  the  samples,  are  multiplied  by  large  importance  weights  and 
suppressed  by  smaller  weights  to  obtain  M  random  samples,  This  distribution  is 
approximated  as  ■  Finally,  apply  a  Markov  transition  kernel  to  obtain 

given  by  invariant  distribution  p{rl^^\yi:t) ■ 

2.1.4  FastSLAM.  FastSLAM  differs  from  other  SLAM  algorithms  in  that  it 
infers  that  the  posterior  can  be  factored  [53]. 

=  p{s^\z^,u^,n^)Y\_p{(^n\s^,  z^,u^,r4)  (2.15) 

n 

It  samples  the  path  using  a  particle  hlter,  in  which  each  particle  m,  is  its  own 
version  of  the  map.  But  each  particle  also  consists  of  N  extended  Kalman  hlters.  The 
particle  contains  the  following  EKF  parameters: 


E 


[m] 


[m] 


(2.16) 


The  original  FastSLAM,  [50]  makes  the  pose  and  observation  estimates  indepen¬ 
dent  of  each  other  measuring  the  pose  through  a  proposal  distribution  based  solely  on 
the  recent  motion  command  and  the  observations  through  resampling.  In  situations 
of  high  motion  noise,  this  leads  to  the  sampling  of  unlikely  poses.  FastSLAM  2.0,  [51] 
samples  the  pose  as  a  function  of  both  the  control  and  the  measurements  yielding 


p{st\s 


This  formulates  the  factorable  proposal  distribution  as 


(2.17) 
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/  p{zt\en,t,nt,sMen,tW-^’^^^z^-W-^)  (2.18) 


Then  resampling  is  conducted  with 


(xp{zt\s^  =  J  J p{zt\9n,t,St,nt).  (2.19) 

Finally,  feature  tracking  is  ensured  reliable  by  the  presence  and  absence  of  features 
in  evidence  calculations.  Originally  from  [24],  this  algorithm  is  modihed  and  en¬ 
sures  likely  feature  tracking  through  log-odds  of  the  physical  existence  of  landmarks 
calculations. 


2.2  Feature  Extraction 

No  matter  what  mapping  filter  method  is  used,  vision  SLAM  and  the  image 
mapping  arena  as  a  whole  depends  on  the  ability  to  retrieve  mathematically  inter¬ 
esting  keypoints  in  an  image.  More  importantly  those  same  keypoints  need  to  be 
recognized  in  similar  images.  When  a  person  looks  at  an  image,  key  features  would 
be  distinct  objects,  buildings  or  interesting  scenic  points  that  would  strike  memory 
upon  recognition  in  another  image.  However,  in  computer  vision  features  are  typically 
detected  by  distinctive  lighting,  intensity,  orientation  changes  as  well  as  expected  ge¬ 
ometric  and  photometric  deformations  throughout  the  image.  Furthermore,  features 
that  are  found  in  an  image  need  to  be  invariant  against  these  as  well  translation, 
rotation,  and  scale  changes.  How  much  invariance  is  required,  however,  is  dependent 
upon  the  problem  at  hand  [6].  Many  researchers  have  found  strong  techniques  that 
have  shown  invariance  in  many  aspects  of  images  and  environments  both  indoor  and 
outdoor.  More  well  known  techniques  have  been  hnding  dissimilarity  and  texture  [65], 
using  Hessian  matrices  [6],  multi-scale  phase  based  features  [14],  as  well  as  one  eval¬ 
uated  later  in  this  thesis  by  called  Scale  Invariant  Feature  Transforms  (SIFT)  [46]. 
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SIFT  has  been  very  successful  in  the  computer  vision  community  and  frequently  used 
in  vision  aided  navigation  [75]  and  image  mapping  [5],  [9],  [22],  [18]. 

2.2.1  Interest  Point  Detectors.  In  order  to  help  hud  these  interesting  points 
in  an  image,  interest  point  detectors  have  been  implemented.  These  help  to  narrow 
down  the  image  space  in  which  feature  detection  methods  are  used.  Harris  Corners 
[33],  developed  in  1988  are  a  combination  of  scale- variant  corner  and  edge  detectors. 
This  method  provided  a  basis  for  feature  extraction  and  is  therefore  followed  by  many 
variations.  Region  detectors  such  as  the  affine  and  scale  invariant  salient  regions  found 
by  [37],  [38]  are  also  popular.  Finally  blobs  or  regions  lighter  than  their  background 
dehned  by  smoothed  curved  edges,  have  been  used  as  well  [21]. 

2.3  Data  Storage  &  The  Data  Association  Problem 

The  data  association  problem  is  probably  the  most  crucial  part  in  image  map¬ 
ping.  A  wrong  association  will  cause  filter  methods  to  diverge  and  maps  to  fall  apart. 
The  problem  is  that  direct  search  methods  such  as  linear  search  can  be  computation¬ 
ally  expensive.  And  while  distance  metrics  identify  matches  well,  determining  outliers 
or  mismatches  can  be  even  more  expensive.  These  are  the  instances  in  which  it  is 
better  to  hnd  a  few  database  objects  that  are  close  in  distance  to  a  query  rather  than 
an  exact  match.  These  nearest  neighbors  have  a  high  probability  of  match.  Given  a 
query  feature,  q  and  a  set  of  database  features,  {pi,  ...,Pn}  an  exact  match  would  be 
such  that  the  distance  between  the  query  and  the  match  is  below  a  set  threshold  and 
closer  than  all  other  points.  Finding  the  /c- nearest  neighbors  entails  identifying  the 
closest  k  distances  to  the  query.  In  covering  common  data  association  methods,  this 
section  reviews  two  common  types  of  data  structures  used  to  store  data  for  improved 
search  techniques,  trees  and  hash  tables. 

2.3.1  Data  Association  in  SLAM.  While  the  FastSLAM  and  FastSLAM  2.0 
implementations  both  used  maximum  likelihood  [49],  [51]  to  solve  the  data  association 
problem,  the  most  common  method  has  been  to  store  and  match  image  features  using 
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/cd-trees.  Both  [57]  and  [67]  use  /cd-trees  with  SIFT  features  in  conjunction  with 
SLAM  via  particle  hltering,  while  [35]  used  them  for  3D  SLAM  but  added  a  Best 
Bin  First  heuristic  to  the  k-d  tree  search.  Finally,  [18]  used  vision-SLAM  to  map  an 
environment  through  visual  odometry  by  associating  SIFT  features  using  hash  tables. 

2.3.2  Linear  Search.  The  most  basic  method  of  solving  the  data  association 
is  through  linear  search.  In  the  context  of  matching  images,  this  is  conducted  by 
matching  each  feature  to  every  other  feature  in  the  database,  typically  by  Euclidean 
or  Mahalanobis  distance  methods.  While  this  may  be  sufficient  for  small  datasets  in 
low  dimensionality,  the  operation  time  and  complexity  increases  with  0(M"')  where 
M  is  the  time  and  complexity  to  search  1  feature  and  n  is  the  number  of  features  in 
the  database. 

At  times  the  most  efficient  option  isn’t  necessarily  to  hnd  an  exact  match  to  a 
specihc  query.  Sometimes,  Ending  a  few  matches  similar  to  the  exact  is  best.  Nearest 
neighbor  matching  algorithms  narrow  down  a  large  dataset  to  a  more  manageable 
space.  Then  an  exact  search  can  be  conducted  on  those  few  neighbors. 

2.3.3  Nearest  Neighbor  Search  Methods  .  The  nearest  neighbor  search  prob¬ 
lem  states  that  given  a  set  V  of  points  in  a  data  structure  can  be  constructed 
which  given  a  query  point  q,  will  find  the  k  points  in  V  with  the  smallest  distance 
to  q  [52].  k  in  this  context  is  the  number  of  nearest  neighbors  desired.  As  alluded  to 
previously,  distance  is  user  defined,  but  typically  an  Ig  norm  is  used  in  which 

d 

|b-g||  =  (^|x.r)^  (2.20) 

There  are  three  main  types  of  nearest  neighbor  (NN)  output  methods: 

•  Range  Search 


•  k-NN 

•  Approximate  /c-NN  (ANN) 
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All  of  these  methods  are  based  on  spatial  distance  from  a  query,  g  to  a  potential 
matching  point,  p.  A  maximum  range  is  established  in  searching  for  a  match  to  a 
query  [52].  The  range  is  the  threshold  that  database  objects  must  meet  in  order  to 
be  matches  or  neighbors.  There  is  no  limit  on  the  number  of  neighbors  returned  and 
there  are  typically  no  heuristics  to  decrease  the  range  as  closer  points  are  found,  k- 
nearest  neighbor  search  simply  hnds  the  closest  k  points  to  the  query.  In  approximate 
/c-NN  search  however,  there  is  a  termination  point  in  accordance  with  hnding  the  k^^ 
neighbor.  Once  the  distance  between  a  point,  p,  and  q  surpasses  a  threshold  or  range, 
no  more  neighbors  will  be  returned,  regardless  of  how  far  off  of  k. 

2.3.4  kd-Trees.  The  /cd-tree  is  a  binary  data  structure  tree  used  to  store 
hnite  points  from  a  /c-dimensional  space  [52].  Given  a  set  of  points  in  each  can 
then  be  placed  in  a  binary  tree  based  on  distance  from  q.  Using  search  methods,  the 
tree  can  prune  down  to  a  single  match  or  a  neighborhood  of  matches. 

2.3.4. 1  Tree  Creation.  Given  an  Nxkd  data  set  in  which  N  is  the 
number  of  samples  and  kd  being  the  number  of  dimensions  in  space  ,  a  /cd-tree 
can  be  constructed  using  statistics  of  the  data  as  well  as  the  d- dimensional  coordinates 
or  nodes  of  each  sample.  The  leaves  of  the  tree  are  nodes  or  sets  of  nodes,  while  the 
branches  are  splits  made  based  on  the  statistics  of  node  locations.  Trees  are  built 
with  0{N),  in  which  N  is  the  time  and  complexity  of  one  completing  one  operation. 

First,  the  variance  of  the  points  across  each  dimension  is  analyzed.  The  dimen¬ 
sion,  i  with  the  greatest  variance  forms  the  hrst  branch  split.  This  split  is  in  the  form 
of  a  hyperplane  and  is  made  at  the  median,  m  of  the  data  in  that  dimension.  The 
hrst  step  is  repeated  with  the  remaining  dimensions;  that  is  the  split  is  continuously 
made  at  the  median  of  the  dimension  with  the  greatest  variance  [7],  [8],  [47].  This 
ensures  balance  in  the  tree  with  depth,  d  <  log2  N  where  N  is  the  total  number  of 
data  points  across  all  samples  or  in  this  case  feature  vectors. 
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Consider  the  2D  example  in  the  left  of  Fig.  2.2  which  forms  the  tree  node  layout 
shown  on  the  right.  This  becomes  increasingly  harder  to  visualize  as  the  number  of 
dimensions  increases,  for  example  the  SIFT  feature  vector  to  be  discussed  later  is  128 
dimensions  in  length,  however  the  theory  remains  the  same. 

The  binary  tree  as  shown  in  the  right  of  Fig.  2.2  is  then  built  based  off  the 
splits.  Each  node  in  the  tree  represents  a  line  forming  a  hyperrectangle  in  the  data 
space.  The  hrst  node  represents  the  split  made  at  0.34  on  dimension  1.  The  left 
and  right  branches  then  represent  the  points  in  the  space  to  the  left  and  right  of  the 
split,  respectively.  Like  before,  the  hrst  split  on  the  left  side  occurs  at  the  median 
of  dimension  2  as  this  is  the  dimension  with  the  greatest  variance  across  the  points 
in  the  hyperrectangle.  This  process  is  repeated  until  the  branches  lead  to  the  leaves 
of  the  tree  which  are  the  data  points  themselves.  In  this  tree,  each  leaf  represents  a 
singular  data  point,  however  some  trees  end  in  small  clusters  of  similar  data  points. 
In  the  latter  case  each  of  the  siblings  in  the  leaf  are  the  nearest  neighbors  of  the 
query.  Observing  the  tree  it  is  seen  that  each  node  is  represented  by  the  {i,  m)  values 
denoting  the  dimension  and  value  among  the  points  the  split  was  made. 

2. 3. 4-2  Nearest  Neighbor  and  Best  Bin  First  Search.  Nearest  neighbor 
algorithms  are  performed  finding  the  data  point  closest  in  Euclidean  distance  to  the 
query  point,  or  feature  vector  to  be  matched.  First  the  tree  is  traversed  in  order  to 
find  the  bin  that  the  query  point  hts  in  according  to  the  current  tree  structure.  In  that 
traversal,  the  one-dimensional  distances  to  those  branching  points  are  also  calculated 
and  recorded  in  a  priority  search  queue.  Once  the  bin  is  found,  the  leaf  in  that  bin  can 
be  tested  for  matching  as  a  good  approximation  to  the  nearest  neighbor.  The  distance, 
D  from  the  q  to  the  data  point  in  that  bin  calculated.  This  bin  is  considered  the  best 
bin.  All  subsequent  searching  is  completed  starting  with  the  best  bin  first  (BBF). 
The  tree  is  then  backtracked,  pruning  off  branches  whose  one-dimensional  distances 
from  q  are  greater  than  D.  This  procedure  ensures  the  most  efficient  detection  of  the 
matching  data  point  [52] . 
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Figure  2.2:  /cd-tree  example.  Each  of  the  points  in  space  are  shown 

in  this  k  =  2  kd-tree  data  space  illnstration  (left).  The  division  lines  are 
made  based  on  the  statistics  of  the  points  in  that  section.  For  instance,  the 
first  line  dividing  A  —  D  and  E  —  H  was  made  along  dimension  1  because 
the  variance  of  the  points  across  dimension  1  was  greater  than  that  of 
dimension  2.  The  line  is  drawn  at  the  median  of  the  points  in  dimension 
1.  This  process  is  repeated  until  the  k  points  are  left.  The  binary  /cd-tree 
branch  diagram  (right)  can  then  be  formed  based  on  the  points  on  which 
side  of  each  division  line  the  points  are  on. 

2. 3. 4-3  Tree  Size.  While  most  literature  fails  to  comment  on  the  size 
of  the  overall  tree  and  number  of  data  points  or  leaves  that  should  appear  at  the 
branch  ends,  the  understanding  is  that  the  tree  should  be  considerably  large  enough 
so  that  the  data  space  can  be  accurately  spread  throughout  allowing  for  the  smallest 
search  space  possible  in  trying  to  compare  query  and  data  feature  vectors.  In  other 
words,  trial  and  error  may  be  the  best  option. 
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2. 3.4.4  Size  of  k.  This  concept  has  been  mainly  been  tested  on  mid¬ 
dimensional  datasets  of  8  —  20  dimensions  [7],  [8],  [52],  Typically,  the  performance 
decreased  as  the  dimensionality  increased.  Adding  heuristics  such  as  those  noted 
above  allowed  for  use  in  much  higher  data  space.  Similar  to  the  intention  for  this 
work,  [47]  performed  similar  tree  matching  techniques  using  128-dimension  SIFT  fea¬ 
ture  vectors  with  good  results.  The  purpose  of  their  work  was  to  experiment  with 
the  NN  algorithm  using  dimensionality  reduction  on  SIFT  vectors  matching  features 
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from  varying  viewpoints  and  scale  changes.  While  they  had  varying  results  in  the 
dimensionality  reduction,  they  had  accurate  matching  using  the  full  length  vector. 
Essentially,  the  rule  of  thumb  to  follow  is  k^^N. 

2. 3.4 - 5  Tree  Maintenance.  Adding  or  deleting  points  from  the  tree 
can  be  done  at  any  time  with  complexity  0{NlogN)]  however,  periodic  balance  checks 
should  be  accomplished  to  ensure  that  the  tree  is  still  balanced  accordingly.  The  tree 
shown  in  Fig.  2.2  is  a  depiction  of  a  fully  balanced  tree.  If  nodes  are  deleted  from 
one  side  of  the  tree  more  than  the  other,  imbalance  occurs.  The  degree  of  imbalance 
required  to  cause  issues  in  the  algorithm  is  dependent  on  the  problem,  but  overall 
this  balance  is  key  when  performing  nearest  neighbor-like  algorithms. 

2. 3. 4- 6  Nearest  Match  vs.  a  close  one.  As  remarked  in  [67],  the  disad¬ 
vantage  of  using  a  /cd-tree  is  that  sometimes  the  match  produced  isn’t  the  nearest  but 
a  close  one.  While  some  algorithms  settle  for  this,  others  use  heuristics,  to  help  ensure 
nearest  match  results.  One  such  heuristic  searches  through  a  predetermined  number 
of  branches  to  help  ensure  that  the  search  wasn’t  stopped  prematurely.  In  another 
implementation,  [52]  built  multiple  /cd-trees,  producing  different  branch  structures 
while  projecting  points  onto  different  hyperplanes  to  ensure  the  right  data  point  was 
converged  on. 

2.3.5  LSH.  Locality-Sensitive  Hashing  (LSH)  was  originally  developed  by 
[34]  then  rehned  by  [31].  By  hashing  points  to  a  series  of  hash  tables,  the  algorithm 
ensures  a  high  probability  of  collision  for  objects  that  are  close  to  each  other  in 
distance  than  those  that  are  farther  apart  [31].  This  concept  and  variations  thereafter 
(  [1],  [3],  [15])  have  been  used  for  similarity  search  on  several  sets  of  data  to  include 
images  in  large  dimensions.  A  new  hashing  family  was  then  introduced  by  [2]  in  which 
the  distances  that  dehned  the  family  were  measured  according  to  the  Ig  norm.  Similar 
to  the  intent  of  this  thesis,  LSH  was  successfully  used  in  agent  localization  for  image 
mapping  in  [59]. 
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A  LSH  algorithm  is  defined  by  the  number  of  nearest  neighbors  output  or  how 
near  a  point  should  be  to  be  considered.  The  first  definition  below  describes  the  cR- 
nearest  neighbor  formulation.  See  Fig.  2.3  for  an  illustration.  This  is  similar  to  k-NN 
and  ANN  search  discussed  previously. 

LSH  Definition  1  Consider  an  Nxd  dataset  V  in  which  N  are  the 
number  of  points,  p  and  d  are  the  number  of  dimensions  in  space 
with  parameters  radius,  R  >  0  and  probability  6  >  0.  A  data  structure 
can  be  constructed  which  given  a  query  point  q,  which  does  the  following 
with  probability  1  —  6:  in  the  event  there  is  an  R-near  neighbor  of  q  in 
V,  the  cR-near  neighbors  of  q  inV  are  reported. 


Figure  2.3:  ci?-near  neighbors  illustration  from  Definition  1. 

LSH  algorithms  use  more  than  one  hash  function  to  store  data,  also  known 
as  hash  function  families,  R.  Increasing  the  number  of  hash  functions  increases  the 
probability  that  a  collision  between  p  and  q  is  much  higher  for  points  that  are  close 
and  lower  for  points  that  aren’t.  This  probability  as  seen  in  Equation  (2.21)  below  is 
related  to  the  distance  between  those  points. 
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LSH  Definition  2  A  family  of  hash  functions,  "H  is  called 
{R,  cR,  Pi,  P2)-sensitive  if  for  any  points  p,q  G  3?^. 

I  ||p-«||<fl  PalM?)  =  A(P)I  >  A 
\  l|p-?ll>C-R  PhIKi)  ^  Hp)]  < 

Pi  and  P2  in  this  instance  are  probability  thresholds  that  meet  Pi  >  P2  {R  < 
cR).  In  other  words,  every  hash  function  in  the  family  must  report  hash  results 
such  that  when  the  distance  between  p  and  q  is  within  R,  there  is  a  high  probability 
(>  -Pi)  of  collision.  Additionally,  when  the  distance  is  greater  than  cR,  there  is  a  low 
probability  of  collision  (<  P2)- 

As  mentioned  earlier,  there  are  a  few  different  methods  to  form  the  LSH  hash  ta¬ 
bles  to  solve  the  nearest  neighbor  algorithm.  This  discussion  is  based  on  the  technique 
developed  by  [31].  This  technique  transforms  all  points  into  the  Hamming  cube,  H'^, 
to  compute  hashes.  See  Appendix  A.l  for  review  on  Hamming  cube  representation 
via  the  Unary  function. 

First,  point  p  is  represented  as  a  binary  vector,  v{p),  with  length  d' ,  where 
d'  =  Cd  and  C  is  the  largest  point  across  all  dimensions  in  p.  Next,  denote  I  as 
the  subset  of  all  dimensions  in  d'\  I  G  {1, . . .  ,d'}.  There  are  t  hash  functions  in 
each  bucket,  and  b  buckets  will  represent  the  hash  sequence  for  each  point.  One  hash 
bucket  is  defined  as: 


9j{v{p))  =  n(p)|q  (2.22) 

where  n(p)|q  is  the  projection  of  v{p)  on  the  coordinate  set  Ij.  The  following  example 
illustrates  the  computation  of  v{p)\i.  in  2  dimensions.  Consider  the  point  from  Equa¬ 
tion  (A. 2)  in  the  example  in  Appendix  A. 3,  v{p)  =  [1110011111]  where  d'  =  10 
and  set  t  =  3.  At  random,  select  t  coordinates  from  the  set,  1, . . . ,  d'  to  project  onto, 
for  example  /  =  {2,  5, 8}. 
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Figure  2.4:  LSH  random  projection  example. 

Fig.  2.4  shows  that  the  projection  of  the  point  onto  the  coordinate  set  in  Ham¬ 
ming  space  is  the  concatenation  of  the  bits  in  those  positions.  Thus,  gj{p)  =  [101]. 
Each  bit  selection  in  gj{p)  is  a  different  hash  function,  h  that  forms  an  overall  hash 
bucket.  Each  bucket  represents  the  hashing  of  a  point  to  a  table.  This  process  is 
repeated  b  times  such  that 


g{p)  =  {gi{p),---,gb{p))  (2.23) 

These  b  concatenations  are  the  indices  to  the  b  hash  tables  point  p  is  stored 
to.  Query  processing  simply  uses  the  same  functions  to  hash  query  points  to.  The 
act  of  retrieving  a  database  point  from  the  indices  calculated  is  considered  a  collision. 
Points  are  retrieved  until  all  b  buckets  have  been  retrieved  or  the  total  number  of  points 
retrieved  exceeds  2b.  To  hnd  an  exact  match  to  q  among  all  of  the  neighbors  gathered, 
requires  using  any  of  the  nearest  neighbor  searches  mentioned  in  Section  2.3.3.  Fig.  2.5 
outlines  the  general  algorithm. 

2.4  Exact  Match  Search  via  Line  Fitting 

With  the  exception  of  linear  search,  the  previous  section  examined  data  asso¬ 
ciation  methods  to  hnd  a  number  similar  to  a  given  query  with  the  high  probability 
that  the  exact  match  is  included.  To  then  hnd  an  exact  match,  this  section  reviews 
line  htting  techniques  to  conduct  image  registration.  As  image  registration  deals  with 
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LSH  Algorithm 

1.  Preprocessing 

(a)  Map  all  points  to  Hamming  space. 

(b)  Choose  t  hash  functions  from  "H  to  form  a  bucket  {gj{p)  = 

(c)  Select  b  hash  function  buckets  {gi{p), . . . ,  gj{p), . . . ,  gb{p)) 

(d)  For  each  j  bucket,  compute  projection  of  vector,  v{p)  on  to  t  dimensions 
in  the  coordinate  set,  Ij 

i.  This  forms  the  tuple  gj{p)  =  {fi'i(p)-  •  -giip)),  for  1  <  j  <  L 

2.  Processing  and  Matching  a  query,  q 

(a)  Hash  each  query  point  using  the  same  b  hash  function  buckets  as  in  pre¬ 
processing 

(b)  Retrieve  all  points  in  the  subsequent  buckets  gi{q),.  ■  ■  ,gb{g)  until  either 
of  the  following  occur: 

i.  All  points  from  the  b  buckets  have  been  retrieved 

ii.  Total  number  of  points  retrieved  exceeds  2b 

(c)  Use  nearest  neighbor  search  techniques  to  answer  query 


Figure  2.5:  LSH  Algorithm. 

aligning  images  with  translation,  rotation,  and  scale  changes,  line  htting  between 
many  feature  matches  between  images  in  the  presence  of  outliers  is  key  to  ensuring 
the  feature  match  locations  in  the  images  are  the  same  [80]. 

2.4-1  Least  Squares  Estimation.  Given  a  set  of  data  points  in  X  and  Y, 
finding  the  line  or  model  that  best  represents  or  fits  the  data,  the  most  common 
technique  is  the  least  squares  estimation  LSE  method.  The  LSE  method  states  that 
out  of  all  possible  models,  the  best  model  follows  the  following: 


Best  Model  =  min 


n 

Y 

.  2=1 


{Vi  -  f{xi)f 


(2.24) 


Fig.  2.6  shows  a  set  of  data  points  that  have  been  £t  to  the  line  shown.  Each  resid¬ 
ual,  or  the  vertical  distance  between  each  point  and  the  line  must  been  within  some 
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threshold  to  be  considered  an  inlier.  Any  points  beyond  this  threshold  are  considered 
outliers  and  do  not  £t  the  model  that  the  data  has  been  £t  to. 


Figure  2.6:  Least  Squares  line  fitting  example. 


Like  other  techniques,  least  squares  estimation  has  its  flaws.  At  times,  it  can 
only  take  as  little  as  one  outlier  to  skew  the  result  producing  an  inaccurate  repre¬ 
sentation  of  what  is  constituted  as  an  inlier.  Fig.  2.7  shows  an  example  of  one  such 
downfall. 


Figure  2.7:  Least  Squares  Estimation  failure  example  [52].  The  least 

squares  line  on  the  left  was  calculated  based  on  the  points  shown.  The 
plot  to  the  right  shows  the  negative  affect  just  one  outlier  can  have  on  line 
fitting  of  a  data  set  using  least  squares. 
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2.4-2  Least  Median  Squares.  Similar  to  the  least  squares  algorithm,  least 
median  squares  is  also  widely  used  to  model  data.  The  least  median  squares  (LMS) 
algorithm  simply  states,  out  of  all  possible  models,  choose  the  model  that  best  mini¬ 
mizes  the  data  such  that 


Best  Model  =  min  (M([(|/i  -  (1/2  -  /(a^2))^  •••,  {Vn  -  f{xn)Y]Y)  (2.25) 

where  M  is  the  median  of  the  residuals.  This  method  is  very  accurate  even  in  the 
presence  of  outliers.  It  outperforms  other  algorithms  to  include  RANSAC,  as  discussed 
later,  up  to  the  case  in  which  the  outliers  account  for  more  than  40%  of  the  dataset  [73]. 
This  is  known  as  the  breakdown  point.  This  method  also  fails  in  situations  in  which 
a  majority  of  the  data  is  positioned  on  a  dominant  plane  in  the  image  [71].  In  other 
words,  since  this  algorithm  is  dependent  upon  the  minimum  median  of  the  residuals, 
if  the  majority  of  those  inputs  are  originating  from  one  localized  area,  it  can  have  an 
undesired  affect  on  the  £t  line. 

2. 5  Summary 

This  chapter  has  provided  a  detailed  study  of  all  techniques  required  to  con¬ 
duct  fast  image  retrieval  for  vision-based  SLAM.  First,  SLAM  filtering  methods  were 
reviewed,  presenting  detail  on  the  pose  estimation  process.  Then  to  answer  the  data 
association,  nearest  neighbor  search  methods  using  tree  and  hash  table-based  ap¬ 
proaches  were  evaluated.  Finally,  to  conduct  exact  match  image  registration  to  match 
the  query  based  on  the  nearest  neighbors  found,  common  line  fitting  techniques  were 
eliminated  as  possible  options. 
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III.  Methodology 

This  implementation  combines  the  work  of  previous  researchers  and  tests  their  capa¬ 
bilities  in  a  SLAM-like  application  with  concentration  on  the  data  association.  The 
results  of  the  data  association  testing  can  be  used  for  vision-SLAM  as  well  as  other 
localization  techniques  through  image  mapping.  The  general  flow  is  shown  in  Fig. 
3.1.  In  separate  tests,  SIFT  features  and  features  of  type  Histograms  of  Oriented 
Gradients  are  broken  into  database  and  query  features  as  determined  by  the  agent’s 
path.  Each  query  is  associated  with  k  similar  neighbors  using  KLSH.  This  method 
stores  each  feature  in  a  hash  table  by  transforming  points  into  kernel  space  prior  to 
completing  hash  calculations.  Then  an  exact  match  amongst  the  neighbors  is  found 
using  RANSAC,  in  which  image  planes  are  aligned  by  calculating  a  transformation 
matrix  using  a  2D  image  homography.  Then  feature  matches  in  each  image  are  com¬ 
pared  using  distance  metrics  allowing  for  removing  of  false  feature  matches,  known 
as  outliers.  The  output  can  then  be  used  for  localizing  agent  pose  in  the  mapping 
process.  This  implementation  was  conducted  running  all  landmark  features  through 
each  method  before  proceeding  to  the  next,  but  can  be  done  iteratively.  This  chapter 
hrst  presents  the  background  of  each  method  used  in  the  implementation.  It  then  de¬ 
tails  how  these  methods  were  used  to  conduct  the  data  association  for  this  SLAM-like 
setup. 

3.1  Implementation  Background 

This  section  discusses  the  background  and  related  work  of  the  concepts  imple¬ 
mented  in  this  research.  First,  SIFT  and  HOG  features  are  reviewed  for  image  feature 
extraction.  To  compute  the  data  association  a  /c-nearest  nearest  neighbor  method,  an 
extension  of  LSH,  known  as  KLSH  is  discussed.  Finally,  RANSAG  is  reviewed  which 
answers  the  query  using  the  nearest  neighbors  provided  by  KLSH. 

3.1.1  Feature  Extraction.  Using  computer  vision  to  detect  features  requires 
strong  detection  capability  in  images  whose  scenes  have  specific  orientation  changes 
among  the  objects  and  vary  in  scale  and  translation.  Tracking  of  these  features  from 
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Figure  3.1;  Implementation  flow  block  diagram. 


image  to  image  requires  repeatability  of  the  detector  so  that  features  from  the  same 
objects  or  keypoints  are  captured.  Finally  features  that  are  found  in  an  image  need 
to  be  invariant  against  translation,  rotation,  and  scale  changes  depending  on  the 
problem.  For  instance,  rotation  invariance  is  required  with  environments  in  which 
the  the  camera  or  objects  being  tracked  are  moving  with  changing  orientations,  such 
as  images  taken  from  a  camera  rotating  about  the  horizontal  axis.  Scale  invariance 
is  required  in  scenes  in  which  the  agent  is  moving  forward  or  backward  through 
the  environment,  causing  objects  to  change  in  size.  This  implementation  uses  an 
agent  that  moves  through  an  indoor,  land  environment.  Therefore,  there  are  minimal 
rotation  changes  throughout  the  image,  just  scale  and  translation. 

Data  Association  via  KLSH  is  tested  on  both  SIFT  and  HOG  features.  SIFT 
features  have  been  found  to  be  successful  in  all  aspects  of  computer  vision  due  to  their 
scale,  rotation,  and  translation  invariance  and  have  been  rehned  on  numerously  since 
originally  implemented  by  Lowe  [46].  HOG  features,  however  are  scale  invariant  and 
have  mainly  been  used  in  low  resolution  facial  and  object  recognition  [12],  [55],  [20]. 

3. 1.1.1  SIFT.  Scale  Invariant  Feature  Transforms  (SIFT)  have  been 
used  in  many  SLAM  applications  such  as  [9],  [13],  [35],  [57],  [67],  [74].  First  the 


image  is  analyzed  in  scale  space  using  a  difference  of  Gaussians  (DOG)  function  for 
various  keypoints  where  interest  points  may  be  located.  Local  extrema  of  the  DOG 
function  are  computed  and  assigned  as  keypoints  by  comparing  each  pixel  with  its 
surrounding  4x4  pixel  neighbors.  If  the  current  pixel  is  greater  or  smaller  than  all 
of  the  surrounding  pixels  in  that  neighborhood  it  is  designated  a  keypoint.  Next,  for 
each  keypoint  a  Gaussian  smoothing  function  is  applied  as  a  function  of  the  scale  at 
all  levels  in  the  image  scale  space.  Next  an  orientation  assignment  is  conducted  using 
gradient  calculations  on  a  specihed  region  around  the  keypoint  given  by: 


m{x,y)  =  +  -  L{x  - 

9{x,  y)  =  tan~ 


1,  y)Y  +  {L{x,  y  +  l)-  L{x,  y  -  1)) 

I  /  L(x,y+l)-L{x,y-l) 

^  L{x+l,y)-L{x-l,y) 


2 


(3.1) 


Then  all  of  the  magnitudes  are  combined  in  a  histogram,  typically  36  bins  long 
covering  360°  of  orientation.  Peaks  within  this  histogram  are  noted  as  dominant 
directions  in  the  gradient.  The  orientation  corresponding  to  the  peak  magnitude  of 
the  histogram  is  assigned  as  the  orientation  for  the  keypoint.  Multiple  keypoints  and 
corresponding  orientations  are  dehned  for  all  magnitudes  that  are  within  80%  of  the 
peak.  This  multiple  assignment  is  rare  however  greatly  contributes  to  the  stability 
of  the  feature  matching  algorithm.  Finally,  the  descriptor  assingment  is  started  by 
applying  a  Gaussian  blur  on  the  region  around  each  keypoint.  Then  the  region  is 
divided  in  to  2x2  subregions.  The  gradient  is  then  calculated  on  each  subregion 
allowing  the  formation  of  another  orientation  histogram  this  time  with  typically  8 
or  9  orientation  bins  depending  on  the  requirements  of  the  system.  A  descriptor  is 
formed  by  concatenating  the  histograms  of  all  subregions.  The  hnal  feature  is  a  data 
structure  that  contains  information  on  the  feature  location,  scale,  orientation  and  the 
descriptor  vector. 


3. 1.1. 2  HOG.  The  Histograms  of  Oriented  Gaussians  (HOG)  imple¬ 
mentation  used  in  this  paper  follows  the  ideas  from  [12],  [20],  [19],  [55].  Previously 
HOG  has  primarily  been  used  on  small  regions  of  an  image  typically  for  the  use  of 
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'  Weighted  vote  into” 
spatial  & 

■  orientation  cells  . 


'Contrast  normalize  by 
overlapping  2x2 
spatial  blocks 


Collect  128-bit  HOG  descriptor 
over  image  patch 


Figure  3.2:  HOG  Implementation  Flow  Block  Diagram. 


pedestrian,  vehicle  and  object  detection  [4],  [20],  [19],  [55].  It  has  been  found  to 
outperform  many  other  descriptors  in  its  class  such  as  Haar  wavelets  and  AdaBoost 
classifiers  [19],  [55].  The  overall  process  is  described  in  Fig.  3.2.  In  starting  with  an 
image  patch,  the  goal  is  to  calculate  a  128-length  descriptor  for  each  patch. 

First,  the  gradient  of  each  image  was  calculated  resulting  in  an  orientation  and 
magnitude  representation  for  each  pixel  as  described  below. 


mag  =  v/(V.)2  +  (V,)2 
6  =  arctan  (6*  G  3fJ[0, 180°],  unsigned) 


(3.2) 


This  gradient  calculation  was  looked  at  in  two  ways.  [20]  reported  that  a  simple 
ID  —10  1  mask  worked  best,  but  the  3x3  sobel  filter  tested  in  [20]  and 
primarily  used  in  [55]  was  also  looked  at  for  comparison.  A  smoothing  function 
was  not  used,  but  instead  the  edges  of  each  image  gradient  were  set  to  zero  to  help 
eliminate  any  negative  edge  effects  such  as  aliasing. 


Next,  a  3x3  block  window  was  defined.  While  for  this  implementation  the  win¬ 
dow  had  to  be  3x3  to  meet  the  requirements  of  the  vision  aided  navigation  algorithm 
being  used,  there  are  multiple  combinations  that  work  well.  See  [19]  for  a  survey. 
Although  the  window  block  size  is  set,  the  number  of  pixels  this  window  can  cover 
is  what  can  vary.  Since  in  the  past  this  method  has  only  been  used  on  small  im¬ 
age  regions  there  were  not  any  guidelines  to  go  by  on  how  big  or  small  the  window 
should  be.  Therefore,  the  implementation  was  first  tested  with  the  recommendation 
by  [20]  for  an  8 x 8-pixel  cell  resulting  in  a  24 x 24-pixel  window.  Each  8x8  cell  is 
represented  as  a  histogram  with  8  orientation  bins.  As  little  as  four  can  be  used. 
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however  better  performance  has  been  found  up  to  nine  bins  (20°  of  separation)  [20]. 
This  implementation  used  eight  to  meet  the  descriptor  requirements. 

Although  just  about  all  feature  extraction  methods  use  gradient  histograms 
to  represent  calculated  features,  this  method  differs  in  its  specific  binning  tactics. 
Instead  of  just  assigning  the  entire  magnitude  to  an  orientation  bin,  bin  assignment  is 
determined  by  a  weighted  vote.  The  two  bins  closest  to  the  pixel  orientation  receive  a 
percentage  of  the  pixel  magnitude  based  on  the  degree  proximity  from  each  bin.  For 
example,  a  pixel  orientation  of  65°  in  a  nine  bin  histogram  would  be  split  between 
bins  three  and  four  or  orientations  centered  at  50°  and  70°  respectively.  The  weighted 
vote  based  on  spatial  distance  would  result  in  75%  of  the  magnitude  being  assigned 
to  bin  four  (70°)  and  25%  assigned  to  bin  three  (50°). 

Next  the  histograms  of  a  2x2  block  of  four  cells  are  concatenated  as  a  32- 
bit  vector  (4  cell  histograms  x  8  bins  in  length  =  32).  This  vector  is  a  partial 
descriptor  for  the  patch  and  will  represent  the  first  cell  in  the  window.  The  vector 
is  then  normalized  to  unit  length  by  calculating  the  I2  norm.  This  overlapping  block 
method  provides  contrast  invariance.  By  repeating  this  method  for  the  other  cells 
in  the  window,  four  cells  are  found  to  have  partial  descriptors  comprising  of  their 
neighboring  cells.  An  example  of  this  is  shown  in  Fig.  3.3.  The  displacement  of 
the  four  copies  of  numbers  1  —  4  represent  four  partial  descriptors  that  make  up 
the  hnal  patch  descriptor.  Each  respective  underlined  number  indicates  the  cell  that 
represents  that  respective  partial  descriptor.  Concatenating  these  four  partial  32-bit 
descriptors  results  in  the  128-length  descriptor  describing  this  window  as  a  feature 
(32  X  4  partial  descriptors  =  128). 

As  this  window  slides  across  every  pixel  throughout  the  image,  thousands  of 
feature  descriptors  are  calculated.  Obviously,  since  all  of  these  descriptors  aren’t 
actually  valid  features,  only  those  that  meet  certain  standards  or  thresholds  were 
kept.  In  any  feature  extraction  method,  one  indication  of  a  good  feature  is  one 
that  has  distinct  contrast  transitions  [46].  This  is  mathematically  noticed  by  those 
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Figure  3.3:  HOG  Concatenation  Assignment.  The  128-length  descriptor  is 
made  up  of  a  series  of  histogram  concatenations.  First,  4  —  2x2  overlapping 
blocks  are  formed  from  each  3x3  patch.  Starting  with  the  top-left  2x2  block 
labeled,  1,  the  8-bin  histograms  from  each  of  the  4  cells  are  concatenated 
forming  a  32-length  vector.  Repeating  the  process  for  the  2x2  blocks 
labeled  2  —  4  and  concatenating  those  vectors  in  block  1  —  4  order  yields  the 
128-length  descriptor  that  describes  the  3x3  block  window. 

gradient  histograms  that  have  magnitude  representation  across  multiple  orientation 
bins.  This  is  the  hrst  threshold  method.  Next  by  eliminating  those  descriptors  that 
don’t  have  peak  magnitudes  of  at  least  some  predetermined  percentage  greater  than 
the  average,  the  number  of  features  is  greatly  reduced  to  a  more  accurate,  not  to 
mention  manageable  number. 

3.1.2  Kernelized  Locality -Sensitive  Hashing.  Kernelized  Locality-Sensitive 
Hashing  (KLSH)  is  similar  to  standard  LSH  in  that  it  computes  hash  functions  using 
random  projections;  however’  all  computations  are  completed  in  kernel  space  using 
only  a  portion  of  the  data.  Each  database  and  query  point  are  transformed  to  kernel 
space  hrst;  therefore  all  statistics  of  the  projection  functions  are  unknown  and  must 
be  estimated.  KLSH,  thus  far  has  been  used  to  conduct  nearest  neighbor  search  on 
several  image  databases  in  both  small  and  large  dimension  sizes  [42]. 
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Just  as  in  LSH,  KLSH  starts  with  defining  the  hash  functions.  This  time, 
however  the  database  and  query  collision  probabilities  are  characterized  by  the  amount 
of  similarity  between  the  two  points  as  described  by  [15]. 

KLSH  Definition  A  locality-sensitive  hashing  scheme  is  a  distribution 
on  a  family  Ti  of  hash  functions  operating  on  a  collection  of  objects, 
such  that  for  two  objects  p,  q, 

Pr[h{p)  =  h{q)]  =  sim{p,q).  (3.3) 

In  other  words,  the  probability  of  collision  between  two  points  is  equal 
to  the  similarity  between  them. 

sim(p,  g),  in  this  definition  is  the  similarity  function  and  h{p)  is  a  hash  function 
selected  at  random  from  H.  The  intuition  behind  this  is  that  when  h{p)  =  h{q),  p 
and  q  collide  in  the  hash  table  or  are  assigned  to  the  same  hash.  Therefore  points 
that  are  highly  similar  have  a  higher  probability  of  being  stored  together  in  the  hash 
table  resulting  in  collision  [43].  This  definition  of  LSH  is  slightly  different  than  the 
one  given  previously  in  the  LSH  discussion  as  defined  by  [31];  however,  the  concept 
is  the  same.  This  implementation  uses  the  definition  as  nsed  by  [42]. 

Describing  the  similarity  fnnction  in  terms  of  the  inner  prodnct  yields: 

sim(p,  q)  =  p^q  (3.4) 

[15]  then  expanded  on  this  fnnction  defining  the  locality-sensitive  hash  function  from 
Eqnation  (3.3)  as 


hrip) 


1,  if  r^p  >  0 
0,  otherwise 


(3.5) 


where  r  is  a  random  hyperplane  from  a  zero-mean  mnltivariate  Gaussian  A/'(0,  Sp) 
with  the  same  dimensionality  as  the  inpnt  vector  p.  This  implies  that  each  hash 
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Figure  3.4:  KLSH  Projection  Example.  The  projection  of  a  random  vec¬ 
tor,  r  on  a  point,  p  across  dimension,  dt  equates  to  the  spatial  position  of 
the  coordinates  in  dt  with  respect  to  r. 

function  is  built  solely  from  the  statistics  of  the  input.  The  proof  showing  that  this 
function  obeys  the  locality-sensitive  property  is  detailed  in  [32].  Since,  one  hash 
equates  to  the  sign  of  one  projection  on  a  point  p,  a  series  of  hashes  is  formed  simply 
by  repeating  this  process  t  times.  This  forms  one  hash  bucket  such  that, 

dip)  =  •  •  • ,  htip),  •  •  • ,  hk{p)) .  (3.6) 

To  understand  the  computation  of  one  projection,  consider  the  following  ex¬ 
ample.  Fig.  3.4  shows  a  set  of  data  points,  V  in  3?^  space.  As  in  Equation  (3.5),  r 
is  formed  from  the  statistics  of  p.  A  random  Gaussian  multivariate  matrix,  TZ  dis¬ 
tributed  by  A/'(0,  Sp)  is  formed.  At  random,  a  dimension  in  p  is  chosen  to  project 
across.  ri(p)  as  shown  projects  across  dimension  2.  Points  to  the  right  of  this  line 
{f^p  >  0)  are  assigned  a  bit  of  1,  while  points  to  the  left  {f^p  <  0)  are  assigned  as 
0.  Repeating  this  process  t  times  yields  the  hashing  scheme  in  column  1  of  Table  3.1. 
All  bits  pertaining  to  each  individual  point  are  concatenated  to  form  the  length,  t 
hash  sequence  for  each  bucket.  Given  b  =  5,  the  table  shows  each  hash  location  in 
the  bucket. 
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Table  3.1;  KLSH  Hashing  Scheme.  The  hash  location  for  a  particular 
point,  p  is  equal  to  the  concatenated  bits  in  the  corresponding  column. 
The  hash  bucket  contains  the  hashes  for  each  point. 


hl,lr{p)  h2,lr{p)  .  .  .  •  •  •  ht,lr{p) 

9{P)  =  <  /iljr(p)  h2,jr^{p)  .  .  .  /iZJr(p)  •  •  •  *■  (3-7) 

hl,br{p)  h2,br{p)  •••  hi^brip)  •••  Kbrip) 

for  1  <  j  <  6,  1  <  /  <  t.  In  KLSH,  all  computation  is  done  in  kernel  space  such  that 
the  similarity  function  in  Equation  (3.4)  extends  to 


(3.8) 


In  this  formulation,  K,{xi,  xj)  is  the  mapping  of  points  Xi  and  Xj  to  kernel  space  using  a 
predehned  kernel  function.  0(«)  therefore,  is  a  compilation  of  the  random  hyperplane 
projection  hash  functions  from  H  using  the  method  just  discussed.  The  problem 
is  that  nothing  is  known  about  the  data  while  in  kernel  space  to  generate  f  from. 
Therefore,  in  order  to  construct  the  hash  function,  r  needs  to  be  constructed  such 
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that  7^(j){x)  can  be  computed  directly  from  the  kernel  function.  Like  the  standard  r,  it 
should  be  approximately  Gaussian  but  allow  the  function  as  a  whole  to  be  computed 
using  only  the  kernels.  This  is  accomplished  by  constructing  r*  as  a  weighted  sum  of 
a  subset  of  the  database  input. 

From  the  central  limit  theorem  discussed  in  Appendix  A. 2,  consider  each  kernel 
data  sample,  0(a;)  as  a  vector  from  some  distribution  T)  with  mean,  fi  and  variance, 
S.  If  t  database  objects  are  chosen  i.i.d.  from  T>  forming  the  set  S',  then  a  random 
variable,  Zt  can  be  dehned  as 


zt  =  (3.9) 

i&S 

As  t  grows  large,  the  central  limit  theorem  states  that  the  subsequent  random  vector 
Zt,  dehned  by 


Zt  =  Vi{zt-n)  (3.10) 

has  a  multivariate  Gaussian  A/'(0,S).  Then  the  whitening  transform  is  applied  to 
yield 


(3.11) 


The  hash  function  then  becomes 


/i(0(a;)) 


1,  if  0(a;)^S^/^St>O 
0,  otherwise 


(3.12) 


As  mentioned  previously,  since  the  original  data  is  being  represented  by  the  kernel 
function,  statistics  such  as  S  are  unknown  and  therefore  must  be  estimated  via  sam¬ 
pling.  Selecting  n  objects  from  the  database  enables  calculation  for  the  mean  and 
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covariance  by  eigendecomposition  and  kernel  principal  component  analysis  methods 
via  [62],  in  which  S  =  VAV'^.  Therefore 

(3.13) 


Dehne  the  new  hash  function  as, 


h{(j){x))  =  sign{(j){x)^V  Zt)  (3-14) 

This  denotes  the  hash  process  for  a  vector  of  kernel  entries.  Now,  consider  the  case 
in  which  there  is  a  matrix  of  kernel  inputs,  K.  Its  eigendecomposition  of  =  UQU^ 
makes  this  an  expanded  form  of  the  single  kernel  input.  The  non-zero  eigenvalues  of 
A  from  Equation  (3.13)  are  the  same  as  the  non-zero  eigenvalues  of  0.  Let  Vm  be 
the  m-th  eigenvector  of  the  covariance  matrix  and  Um  be  the  m-th  eigenvector  of  the 
kernel  matrix.  From  [62],  compute  the  projection 


n 

VrJ(t){x)  =  ^ 

i=\ 


(3.15) 


in  which  the  (/)(xi)  terms  are  the  n  data  point  samples  mention  previously.  Performing 
this  computation  over  all  m  eigenvectors  results  in 


h(0(a;))  =  (1){xY'VA  ^  ^/^VrJ(j){x)'^VrJ Zf  (3.16) 

m=l 


Substituting  Equation  (3.15)  yields 


V  •_! 


(3.17) 


Simplifying  yields 
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n 


n 


n 


1 


(3.18) 


_  1 

Since  =  X]fc=i  ^  further  simplification  yields 


K(t>i.x))  =  ^M;(f)(0(a;i)^0(a;)),  (3.19) 

i=\ 

where  w{i)  =  Yl%iKi~^/‘^(j){xjYzt. 

The  Gaussian  random  vector  r,  becomes  r  =  This  is  therefore  a 

weighted  sum  over  the  vector  inputs  from  the  n  database  kernel  entries.  Substituting 
the  random  vector  of  kernels,  Zt  =  ^  Equation  (3.9)  in  w{i)  yields 

“(')  =  7  E  E  -  7  E  E  (3.20) 

^  j=i  les  ^  j=i  k=i 

Since  the  y/i  term  has  no  affect  on  the  sign  of  the  hash  function,  it  can  be  ignored. 
Final  simplihcation  for  w{i)  is  as  follows: 

1.  Dehne  e  to  be  a  vector  of  all  ones  and  to  be  a  vector  with  ones  in  the  entries 
corresponding  to  the  indices  of  S.  The  resulting  w{i)  is 


2.  Therefore,  the  hnal  hash  function  for  a  kernel  input  is 


(3.21) 


h{(f){x))  =  sign 


E 

.  ^=1 


W[l}K[X, 


(3.22) 


in  which  n  is  the  mapping  of  points  x  and  Xi  to  kernel  space.  Similar  to  the  hash¬ 
ing  discussion  at  the  beginning  of  this  section,  each  hash  will  concatenate  multiple 
iterations  forming  a  bucket. 
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KLSH  Algorithm 

1.  Pre-processing 

(a)  Form  kernel  matrix,  K  over  n  data  points  from  database 

(b)  Form  by  selecting  t  indices  at  random  from  the  set  {1, . . .  ,n} 

(c)  Each  ht{4>{x))  performs  a  projection  on  indices 

(d)  Form  w  =  —  ^e) 

(e)  Project  the  vector  w{i)  onto  the  point  in  kernel  space  in  the  same  manner 
as  the  example 

i.  Form  hash  bucket  gj{x)  by  assigning  bits  accordingly 

2.  Query  Process 

(a)  Form  the  same  L  hash  buckets  per  Equation  (3.23)  as  done  for  the  database 
points 

(b)  Use  nearest  neighbor  techniques  to  answer  query 


Figure  3.5:  KLSH  Algorithm. 


9j{x) 


hi{(j){x))  h2,j{<j){x))  ...  ht,j{(j){x))  ...  hk,j{<j){x)) 


,  (1  <  /  <  t),  (1  <  j  <  b) 
(3.23) 


To  determine  the  best  parameter  set,  this  implementation  tests  accuracy  of 
query  matches  throughout  multiple  iterations,  each  time  varying  the  number  of  points 
in  the  database,  n,  the  number  of  hash  functions  per  bucket,  b  as  well  as  the  number 
of  feature  points  per  image,  t  in  each  sample  to  estimate  statistics  from.  The  hnal 
algorithm  detailing  this  process  is  in  Fig.  3.5. 


3.1.3  RANSAC.  Given  two  images  of  the  same  scene  but  translated,  the 
features  in  each  should  be  the  same,  however,  their  locations  will  be  different.  In 
selecting  matching  features  between  images,  the  goal  is  to  eliminate  the  outliers,  or 
those  matches  which  mathematically  appear  to  be  the  same  according  to  distance 
metrics,  but  visually  are  not  as  shown  in  red  in  Fig.  3.6. 

To  handle  this  task,  this  implementation  uses  the  RANdom  Sampling  And  Con¬ 
sensus  (RANSAC)  technique.  In  the  past,  RANSAC  has  been  used  for  tasks  such  as 
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Figure  3.6:  Identifying  inliers  and  outliers  during  feature  matching  be¬ 

tween  2  translated  image  frames.  This  is  an  arbitrary  example  in  which 
each  blue  circle  or  diamond  represents  any  feature  type.  The  blue  lines 
represent  correctly  matched  features,  or  inliers.  The  red  lines  represent 
feature  mismatches  or  outliers. 

model  parameter  estimation  in  the  presence  of  outliers  [29] .  It  has  recently  been  used 
for  data  association  with  vision  only  SLAM  using  an  EKF  [17]  as  well  as  for  agent 
localization  in  [63],  [30],  [59],  and  [64],  One  reason  RANSAC  is  favored  over  other 
techniques  is  its  ability  to  give  accurate  results  even  in  datasets  in  which  more  than 
50%  of  the  set  are  outliers.  This  percentage  is  known  as  the  breakdown  point.  As 
mentioned  previously,  the  LMS  technique  falls  prey  to  this  limit,  as  does  least  squares 
estimation  [80].  To  begin,  RANSAC  is  broken  into  two  main  steps,  hypothesis  and 
test. 


In  the  hypothesis  step,  a  minimal  sample  set  (MSS)  of  the  data  is  randomly 
selected.  RANSAC  differs  from  other  algorithms  in  that  this  sample  set  must  be  as 
small  as  sufficiently  possible  in  order  to  model  the  data  as  opposed  to  using  the  entire 
dataset  in  least-squares  and  least  median  squares. 

The  test  step  simply  tests  how  many  samples  of  the  entire  dataset  fit  the  model. 
The  set  of  samples  that  meet  this  criterion  are  known  as  the  consensus  set  (CS),  in 
which  larger  sets  are  considered  better.  These  steps  are  repeated  until  the  probability 
of  Ending  a  better  CS  falls  below  a  predetermined  threshold.  Another  termination 
method  is  to  use  the  best  CS  out  of  a  specific  number  of  iterations.  The  disadvantage 
to  this  is  that  in  repeating  the  algorithm  until  probabilistically  complete  has  the 
potential  risk  of  no  cap  on  computation  time.  Furthermore,  while  the  probability 
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of  a  good  model  increases  with  the  number  of  iterations,  there  is  no  guarantee  that 
the  optimal  solution  has  been  selected  when  bounding  the  number  of  iterations  [80]. 
As  this  process  is  random,  it  can  end  in  different  results  from  one  run  to  the  next, 
however  stronger  features  and  many  iterations  minimize  this  occurrence. 

3. 1.3.1  Calculating  the  Algorithm  Termination  Point.  As  just  men¬ 
tioned,  the  hypothesis  and  test  steps  are  ran  iteratively  until  the  probability  of  hnding 
a  better  CS  falls  below  a  predetermined  threshold  or  the  predetermined  number  of 
iterations  has  been  reached.  Another  method  is  to  combine  the  two  and  calculate  the 
number  of  iterations  required  produce  a  given  probability  threshold.  If  puss  is  des¬ 
ignated  as  the  probability  of  sampling  from  the  dataset  V  a  MSS  s  that  produces  an 
accurate  estimate  of  the  model  parameters.  Therefore  the  probability  of  producing  an 
MSS  with  at  least  one  outlier  is  1—mss-  Consider  the  rare  situation  in  which  in  the  T 
MSSs  produced  were  all  contaminated  by  outliers  with  probability  {I—PmssY-  Since 
this  probability  decreases  to  0  as  T  increases,  the  solution  is  to  choose  the  number  of 
iterations,  T  such  that 


(1  —  PmssY  ^  ^  (3.24) 

where  e  is  a  probability  threshold.  Solving  for  T  and  using  the  ceiling  function,  [x] 
yields 


T 


loge 

log{l  -  Pm ss) 


(3.25) 


Essentially,  T  is  the  smallest  number  of  iterations  required  to  produced  the  probability 
term  in  the  ceiling  function. 


3. 1.3. 2  Inlier/ Outlier  Threshold.  Once  the  model  is  chosen  and  the 
points  are  being  compared  to  determine  inliers,  the  selection  is  based  on  a  distance 
threshold.  The  selection  of  this  threshold  is  key  to  the  accuracy  of  the  result.  Setting 


41 


the  threshold  too  high  will  result  in  bad  models  being  ranked  equally  with  good  ones 
as  shown  in  Fig.  3.7a.  Fig.  3.7b  shows  that  setting  as  the  threshold  too  low  will  yield 
the  opposite  in  that  none  of  the  points  will  £t  the  model. 


(a) 


(b) 


Figure  3.7:  RANSAC  distance  threshold  effects.  Example  of  a  threshold 
set  too  high  (a).  Example  of  a  threshold  set  too  low  (b)  [72]. 


3.1.4  Integration  into  Data  Association  Implementation.  For  each  image 


Pq  and  Ptransi  Nxd  feature  sets  are  retrieved,  in  which  N  is  the  number  of  features 
captured  while  d  is  the  number  of  dimensions,  128.  First  the  features  in  each  image 
are  matched  comparing  all  features  with  each  other  finding  the  minimum  I2  Euclidean 
distance  norm.  The  locations  of  the  feature  mappings  whose  distances  are  below  a 
threshold  are  stored  and  become  the  dataset  as  to  which  RANSAC  analyzes.  Loca¬ 
tions  of  features  from  image  Pq  are  stored  in  po  and  locations  from  Ptrans  in  Ptrans- 

As  RANSAC  is  a  random  process,  the  MSS  is  determined  by  selecting  n  match¬ 
ing  point  samples  at  random  with  replacement  between  iterations.  In  order  to  repre¬ 
sent  the  correspondence  between  the  two  images,  a  homography  must  be  estimated. 
See  Appendix  A. 3  for  a  more  details  discussion  than  the  derivations  that  follow.  The 
2D  version  is  represented  by  a  3  x  3  matrix  H .  This  transformation  matrix  maps  the 
strong  features,  pg  in  the  hrst  image  to  the  same  plane  as  the  corresponding  features, 
Ptrans  In  the  second  image  such  that 
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Po  =  Hpo 


(3.26) 


in  which  po  represents  the  feature  locations  of  po  in  the  same  plane  as  ptrans- 

To  solve  this  transformation  model,  the  entries  of  H  are  derived  by  the  linear 
system,  Ah  =  0,  which  is  computed  as  discussed  in  Section  A. 3.  A  is  a  2nx9  matrix 
in  which  n  is  the  number  of  points  used  for  the  model.  The  system  is  solved  by  per¬ 
forming  singular  value  decomposition  on  A  where  H  is  derived  from  the  eigenvector, 
Vi  corresponding  to  the  smallest  eigenvalue  of  A  for  (1  <  i  <  9). 


472 

473 

4^4 

475 

476 

00 

479 

(3.27) 


This  essentially  eliminates  the  translation  between  the  two  images  allowing  for 
feature  to  feature  comparison  among  the  feature  matches  using  Euclidean  distance. 
Those  features  that  are  below  a  predetermined  threshold  are  considered  inliers.  This 
makes  up  the  CS.  This  process  is  repeated  many  times,  R  noting  the  number  of  inliers 
each  iteration.  The  model  that  produces  the  largest  consensus  set  is  considered  the 
best  choice.  The  value  of  R  is  dependent  on  how  many  iterations  it  takes  for  the 
probability  of  a  better  inlier  result  to  occur.  This  probability  gets  smaller  and  smaller 
as  the  R  increases.  This  implementation  evaluates  this  iteration  parameter  across 
many  values  testing  each  time  for  the  number  of  iterations  that  minimizes  the  overall 
count,  yet  meets  a  predetermined  threshold  of  accuracy. 


3.2  Computing  the  Data  Association 

Each  input  to  the  KLSH  algorithm  includes  the  features  of  the  query  image 
along  with  the  features  of  all  the  images  being  queried  according  to  an  orientation 
heuristic.  During  preprocessing,  the  agent  orientation  at  the  time  of  each  observation 
is  retrieved  from  an  internal  navigation  system  known  as  FV-SIFT.  This  algorithm 


43 


separately  tracked  image  features  updating  a  pose  estimated  from  the  IMU  shown  [76]. 
The  agent  orientation  at  each  time  step  an  image  was  taken  is  notated.  Those  database 
image  whose  orientations  match  that  of  the  query  are  input  to  KLSH.  For  each  query 
image  feature,  m  image  identihers  are  returned.  The  top  k  identihers  whose  feature 
points  in  3?^^®  space  are  closest  to  the  query  are  output  as  neighboring  matches. 

Just  as  there  is  uncertainty  in  the  pose  at  each  landmark,  there  is  uncertainty 
in  the  data  association  as  well.  This  can  be  modeled  by  the  following.  For  every 
query  feature,  there  will  be  at  most  fc-nearest  neighbors  returned.  The  accuracy  of 
the  output  depends  on  the  number  of  features  in  each  image  used  for  analysis.  An 
increase  in  the  number  of  features  increases  query  match  accuracy  but  also  increases 
the  complexity,  memory  requirements  and  runtime.  This  implementation  ran  multiple 
iterations  on  each  dataset,  varying  in  the  number  of  features  kept.  The  following  is 
an  example  of  the  nearest  neighbor  matching  process.  Fig.  3.8  below  shows  a  sample 
path  through  an  environment.  As  shown,  each  marker  represents  a  landmark  at  which 
an  image  was  taken. 


Ll/Ql  L3 


Figure  3.8:  Data  Association  Computation  Example:  Sample  Path.  LI  — 
L12  are  landmarks,  or  points  in  the  environment  at  which  an  image  was 
taken.  After  completing  the  first  loop  and  returning  to  LI,  query  image, 
Q1  is  taken,  followed  by  Q2  —  Q12  assuming  the  path  was  repeated. 

As  the  agent  repeats  the  loop  and  reappears  at  landmark  LI,  the  following  data 
association  example  begins.  At  this  point,  the  agent  knows  that  it  is  at  a  location 
that  it  has  seen  before,  but  does  not  know  which  landmark  that  is.  This  hrst  point 
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of  reappearance  is  set  as  query,  Ql.  In  order  to  determine  this  location  relative  to 
landmark  position,  the  image  features  for  query  image  Ql  as  well  as  the  features 
for  each  landmark  LI  —  L12  are  input  to  the  nearest  neighbor  KLSH  algorithm. 
Tables  3. 2-3. 5  represent  the  /c-NN  search  process  via  KLSH  with  parameters  k  =  4 
and  10  features.  Fig.  3.9  shows  an  illustration  of  how  the  features  from  each  image 
are  identihed  by  the  image  ID  (HD).  In  this  arbitrary  example,  landmarks  LI  —  L12 
mentioned  earlier,  are  represented  as  IIDs  100  —  111  respectively. 


Figure  3.9:  Data  Association  Computation  Example:  KLSH  input.  This 
arbitrary  example  illustrates  the  database  feature  identification  and  asso¬ 
ciation  to  an  image  ID.  IIDs  100  —  111  correspond  with  landmarks  LI  — L12. 
These  along  with  the  query  features  for  a  particular  query  landmark  are 
sent  as  inputs  to  the  KLSH  algorithm.  The  results  for  k  =  A  are  shown 
in  Tables  3. 2-3. 5.  Note:  The  results  were  arbitrarily  chosen  to  represent 
process  flow  and  were  not  actually  calculated. 


The  outputs  are  the  fc-nearest  neighbors  that  are  most  similar  to  the  input 
query  as  shown  in  Table  3.2.  Each  query  feature  (QF)  matches  to  fc  =  4  neighboring 
landmark  database  features  (DF).  The  image  ID  for  each  DF  is  also  notated. 


Table  3.2:  Data  Association  Computation  Example:  KLSH  output.  The 
outputs  are  the  fc-nearest  neighbors  for  each  feature. 


Query 
Image  1 

Query 

Feature 

# 

Matching  Feat#/Source  Image  ID 

NN1 

NN2 

NN3 

NN4 

QF1 

DF01/100 

DF02/100 

DF24/102 

DF10/101 

QF2 

DF13/101 

DF05/100 

DF59/105 

DF04/100 

QF3 

DF02/100 

DF21/102 

DF31/103 

DF42/104 

QF4 

DF04/100 

DF05/100 

DF09/100 

DF17/101 

QF5 

DF49/104 

DF06/100 

DF36/103 

DF26/102 

QF6 

DF07/100 

DF03/100 

DF08/100 

DF08/100 

QF7 

DF15/101 

DF03/100 

DF54/105 

DF46/104 

QF8 

DF06/100 

DF22/1 02 

DF04/100 

DF55/105 

QF9 

DF01/100 

DF31/103 

DF29/102 

DF49/104 

QF10 

DF01/100 

DF02/100 

DF51/101 

DF16/101 
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Next  in  Table  3.3,  the  number  of  occurrences  of  each  image  ID  for  each  feature 
match  is  tallied.  These  tallies  are  then  normalized  with  respect  to  each  feature  in 
Table  3.4  to  generate  a  weight  to  identify  the  hnal  nearest  neighbors. 


Table  3.3;  Data  Association  Computation  Example:  Identify  image  rep¬ 
resentation. 


Query 

Image 

1 

Query 

Feature 

# 

Source  Image  IDs 

100 

101 

102 

103 

104 

105 

106 

107 

108 

109 

110 

111 

QF1 

2 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

QF2 

2 

1 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

QF3 

1 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

QF4 

3 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

QF5 

1 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

QF6 

4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

QF7 

1 

1 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

QF8 

2 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

0 

QF9 

1 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

QF10 

2 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Table  3.4:  Data  Association  Computation  Example:  Normalize  image 

representation  across  all  images. 


Query 

Feature 

# 

Source  Image  IDs 

100 

101 

102 

103 

104 

105 

106 

107 

108 

109 

110 

111 

QF1 

0.5 

0.25 

0.25 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Query 

Image 

-1 

QF2 

0.5 

0.25 

0 

0 

0 

0.25 

0 

0 

0 

0 

0 

0 

QF3 

0.25 

0 

0.25 

0.25 

0.25 

0 

0 

0 

0 

0 

0 

0 

QF4 

0.75 

0.25 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

QF5 

0.25 

0 

0.25 

0.25 

0.25 

0 

0 

0 

0 

0 

0 

0 

QF6 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

QF7 

0.25 

0.25 

0 

0 

0.25 

0.25 

0 

0 

0 

0 

0 

0 

QF8 

0.5 

0 

0.25 

0 

0 

0.25 

0 

0 

0 

0 

0 

0 

QF9 

0.25 

0 

0.25 

0.25 

0.25 

0 

0 

0 

0 

0 

0 

0 

QF10 

0.5 

0.5 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Table  3.5:  Data  Association  Computation  Example:  Query  /c-NNs.  These 
are  the  /c-NNs  to  the  the  query.  This  is  the  input  to  RANSAC. 


Source  linage  IDs 

100 

101 

102 

103 

Query  Image  1 

4.75 

1.5 

1.25 

0.75 

Finally,  Table  3.5  shows  the  /c-nearest  neighboring  image  IDs  that  have  the  most 
representation  as  shown  by  their  weightings,  across  all  of  the  query  features.  These 
k  =  4  images  are  analyzed  further  using  RANSAC  to  hnd  an  exact  match  to  the 
query. 


3.2.1  An  Exact  Match  From  k-NNs.  To  complete  the  data  association 
problem,  the  top  k  nearest  neighbors  from  the  KLSH  output,  as  shown  in  Table  3.5 
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are  retrieved  and  analyzed  for  an  exact  match  using  RANSAC.  The  original  query 
image  is  compared  with  each  possible  match  yielding  k  different  inlier  results.  Recall 
the  number  of  inliers  are  those  feature  matches  between  images  that  meet  distance 
thresholds  calculable  from  a  good  transformation  model.  After  many  iterations,  the 
consensus  set  with  the  most  inliers  is  selected  as  the  most  probabilistic  match  for  the 
query. 

To  determine  the  number  of  iterations,  this  algorithm  actually  followed  both 
methods  discussed.  An  arbitrary  number  of  iterations  was  made,  justihed  by  the 
accuracy  of  match.  The  output  is  the  time  stamp  of  the  image  that  has  the  most 
feature  matches  between  the  two  image  spaces. 

3.3  Summary 

This  section  has  reviewed  the  background  research  behind  all  of  the  methods 
used  to  compute  the  data  association  and  explained  how  it  is  implemented  to  solve  the 
data  association  problem  in  vision-based  SLAM.  Once  features  have  been  extracted 
through  either  SIFT  or  HOG  feature  detection,  hash-based  methods  via  KLSH  are 
used  to  narrow  down  the  search  space  in  matching  a  single  query  feature  to  an  entire 
database  of  features,  /c-nearest  neighbor  search  methods  result  in  each  query  image 
being  matched  with  the  /c-most  similar  stored  database  images.  The  image  IDs  for 
each  image  are  then  matched  with  the  query  side  by  side  using  homography-based 
RANSAC.  By  correctly  modeling  the  matching  features  in  each  image  to  the  system, 
false  matches  can  be  detected,  resulting  in  the  identihcation  of  the  correct  image 
through  standard  distance  techniques.  The  matches  found  answer  the  data  associ¬ 
ation  for  each  instance  in  an  agent’s  movement  throughout  the  environment.  The 
associations  can  be  extended  for  use  in  vision-based  SLAM  and  other  image  mapping 
algorithms. 
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IV.  Results  &:  Analysis 

This  chapter  discusses  the  specihc  tests  conducted  to  use  KLSH  to  answer  the  data 
association  when  determining  agent  pose  in  an  environment.  First  the  testing  details 
are  discussed  including  vehicles,  test  grounds,  and  parameters  used.  Next  the  results 
of  using  KLSH  are  presented  and  an  analysis  of  its  robustness  as  a  viable  option  to 
conduct  data  association  is  assessed. 

4.1  Testing  Procedure 

4.1.1  Terrain.  All  tests  were  completed  indoors  in  a  hallway  environment. 
The  overall  path  as  shown  in  blue  in  Fig. 4.1  was  rectangular  in  shape  with  a  perimeter 
of  approximately  210m.  The  hallway  had  tile  flooring  and  is  2.5m  wide.  The  agent 
started  at  the  top  left  blue  arrow  and  traversed  the  path  twice  in  the  counterclockwise 
direction  producing  just  over  1260  images.  From  this  path  two  smaller  datasets  were 
derived.  The  path  in  red  represents  the  traversing  of  approximately  140m  on  3  of  the 
straightaways.  2  legs  of  this  path  produced  a  dataset  with  510  images.  Finally,  the 
last  dataset  simulates  a  path  in  which  the  agent  made  an  observation  in  the  form  of 
an  image  every  5m.  The  black  and  yellow  circles  shown  are  not  to  scale  but  provide  a 
general  idea  of  the  5m  distance  with  respect  to  the  overall  floor  plan.  This  2-lap  5m 
dataset  consists  of  135  images. 

It  should  be  noted  that  the  hallway  used  can  be  classihed  as  office-like.  The 
distinction  is  made  to  express  volume  the  traffic,  causing  inconsistent  features  in  the 
environment.  While  the  building  in  which  the  tests  were  conducted  was  not  a  densely 
populated  area,  there  were  instances  of  people  walking  through  the  hall  that  provided 
positive  obstacles  for  the  feature  algorithm  to  overcome.  In  computer  vision,  a  person 
or  object  that  is  apparent  in  one  image  and  not  in  the  next  can  have  adverse  affects 
on  image  registration. 

4.1.2  Agent.  The  agent,  the  Pioneer  2- AT  in  shown  in  Fig.  4.3  was  driven 
via  remote  control.  Contained  in  its  setup  is  all  of  the  hardware  required  for  data 
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^  1260  image  dataset  (210  m) 

I  510  image  dataset  (140  m) 

O  135  image  dataset  (5m  Imks) 


Figure  4.1:  Environment  Layout. 


collection.  The  hardware  used  is  listed  in  Fig.  4.2.  (Fig.  4.3  does  display  some 
hardware  not  listed.  These  devices  were  not  used  in  this  implementation.) 

Pioneer  2-AT  Hardware 

•  PixeLINK  PL-A741  machine  vision  cameras(2) 

•  MicroRobotics  MIDG  II  consumer  grade  microelectricalmechanical  systems 
(MEMS)  IMU 

•  Vision  Computer 

—  2.0  GHz  Intel  Core2Duo  processor 

—  4  GB  memory 

—  Nvidia  9800GTX+  Graphics  Processing  Unit  (GPU) 

•  Internal  Computer 

—  Records  vehicle  odometry 


Figure  4.2:  Pioneer  2-AT  Hardware. 


The  internal  computer  on  board  the  robot  calculates  odometry  based  on  the  skid- 
wheel  steering  of  indoor  tires  with  dimensions:  Tin  diameter,  2in  width.  The  average 
vehicle  speed  was  about  1  m/s.  As  the  vehicle  was  driven  by  remote  control,  there  were 
occurrences  of  variance  in  speed.  These  were  assumed  to  be  negligible  in  mapping. 


49 


Figure  4.3:  Pioneer  2-AT  vehicle  used  for  data  collection.  The  circled 

camera  on  the  left  of  the  camera  bar  was  used  to  conduct  data  association. 
The  IMU  is  mounted  at  the  center  of  the  camera  bar  while  the  vision 
computer  and  communications  equipment  are  mounted  towards  the  rear 
of  the  chassis. 


4.1.3  Image  and  Sensor  Data  Retrieval.  The  cameras  as  shown  in  Fig.  4.3 
are  placed  in  stereo  mode;  however,  this  implementation  only  uses  images  from  the 
slave  camera,  as  denoted  by  the  circle  in  the  hgure.  It  has  a  90°  held  of  view  and 
images  are  taken  at  a  resolution  of  1280  x  960  at  a  rate  of  2  Hz.  Taking  into  account  the 
average  speed  of  1  m/s,  images  were  captured  approximately  every  0.5  m.  The  image 
retrieval  algorithm  for  this  system  is  actually  running  a  highly  parallelized  SURF 
feature  extraction  program  for  a  FV-SIFT  [76].  FV-SIFT  is  an  internal  navigation 
system  that  tracks  image  features  to  update  the  pose  given  from  inertial  measurements 
[76].  This  section  of  the  process  is  calculated  by  the  GPU  which  simultaneously  allows 
for  the  recording  of  stereo  images  and  IMU  data.  The  vehicle’s  internal  computer 
records  odometry  used  for  pose  estimation  and  links  to  the  external  computer  via 
Ethernet  cable  for  time  synchronization  and  communication  with  FV-SIFT. 

For  the  purposes  of  this  implementation,  the  images  from  the  slave  camera  as 
well  as  the  orientation  information  from  FV-SIFT  are  used.  The  images  are  saved  as 
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portable  graymap  files  (.pgm)  in  Netpbm  format,  while  the  other  data  collected/cal¬ 
culated  are  written  as  logs  to  text  hies. 

4.. 2  Implementation  Details  &  Parameters 

The  images  are  analyzed  for  features  using  a  compiled  executable  for  SIFT 
features  [46]  and  Matlab®  2010a  for  HOG.  All  feature  descriptors  and  parameter 
information  are  stored  as  Matlab®  .mat  hies.  As  this  implementation  does  not  include 
a  loop  closure  heuristic,  the  image  data  is  manually  split  between  database  and  query 
by  visually  determining  the  identihers  of  the  hrst  and  second  loop.  Each  query  is 
then  processed  by  the  Matlab®  written  Data  Association  algorithm.  All  processing 
was  completed  on  dual  core  processors  with  4  GB  of  memory. 

4.2.1  Memory  Obstacles.  As  mentioned  previously,  this  implementation  was 
run  on  a  dual-core  computer  with  limited  memory.  For  this  as  well  as  possible  future 
use  in  online  mobile  algorithms,  it  is  important  to  add  heuristics  that  constrain  the 
number  of  features  in  the  overall  database  or  those  sent  to  the  KLSH  algorithm  at 
one  time. 


4. 2. 1.1  Database  Reduction  Heuristic  .  The  feature  database  as  a 
whole  can  have  hundreds  of  thousands  of  features  that  must  be  sifted  through  in 
order  to  hnd  a  match  to  a  given  query.  By  dividing  the  feature  database  into  classes, 
the  number  of  features  to  be  hashed  for  a  particular  query  analysis  is  reduced.  One 
such  class  is  based  on  heading.  The  features  going  in  one  direction  will  be  different 
than  the  features  that  are  seen  going  the  opposite  direction  in  the  same  environment. 
Therefore,  a  query  can  only  be  matched  to  features  whose  difference  in  orientation  is 
within  tolerance.  This  implementation  assigns  the  heading  categories  shown  in  Table 

4.1  to  each  feature  based  on  heading  information  gathered  from  the  FV-SIFT  output. 
This  results  in  each  query  being  matched  with  features  in  1  of  4  KLSH-driven  hash 
tables. 
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Table  4.1;  Database  Reduction  Heuristic:  Orientation  Classification.  This 
is  the  orientation  classification  used  to  classify  the  features  in  each  image. 

Heading  Range  Category 


0-51  1 

52  -  143  2 

144  -  225  3 

226  -  321  4 

322  -  359  1 


4-2. 1.2  Multiple  KLSH  runs  per  kernel.  While  trimming  down  the 
database  size  with  orientation  heuristics  helps,  this  implementation  still  ran  into 
memory  issues.  This  implementation  simply  broke  the  KLSH  hashing  process  into 
multiple  iterations.  Define  as  the  number  of  database  features  corresponding  to 
the  orientation  of  q.  Dividing  Npp  throughout  l  iterations  produces 


Nf,9 

i 


(4.1) 


features  per  iteration.  This  reduces  the  complexity  required  in  hashing  Np^  points 
to  b  tables  to  l  iterations  of  points  to  b  tables.  This  has  a  direct  affect  on  the 
kernel  matrix  size.  Consider  Np^e  =  lOK,  l  =  20.  By 


N, 


Ffi 


lOK 


=  500, 


(4.2) 


the  kernel  matrix  decreases  in  size  from  lOiL  x  lOiL  to  20-500  x  500  kernel  matrices, 
thereby  lessening  memory  requirements  to  that  of  a  standard  computer  when  running 
implementation.  Furthermore,  the  speed  at  which  these  table  hashings  are  computed 
is  increased. 


4.2.2  Kernel  Matrix  Generation.  [42]  reported  that  both  the  Gaussian 
radial  basis  function  (RBF)  and  Chi-Squared  kernels  were  used  in  testing  on  multiple 
datasets.  Therefore  this  implementation  tested  on  both  as  well. 

Gaussian  RBF  kernel 
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(4.3) 


_IIP-9lP 

k{p,  q)  =  e 


Chi-Squared  kernel 


k{p,  q) 


^  Qi? 

^  l{Pi  +  qi)' 


(4.4) 


In  each  case,  p  and  q  are  the  N  x  d  matrices  of  database  and  query  features  in  which 
N  and  d  are  the  number  of  features  and  number  of  dimensions,  respectively.  For  a 
given  KLSH  iteration,  the  query  inputs  are  the  features  from  the  query  image,  NqX  d 
as  well  as  the  comparison  database  features,  Np  x  d.  Naturally,  Np^  Nq.  In  order 
to  compute  k,  these  matrices  must  be  the  same  size.  The  solution  to  this  is  to  build 
the  q  kernel  input  such  that  it  matches  the  dimensions  of  p  by. 


Na  X  d  matrix  of  query  features 
'  ^  ^  (4.5) 

{Np  —  Nq)  X  d  matrix  of  ones 

4-3  Results 

To  determine  the  best  fit  for  parameters  and  test  overall  performance  of  the 
KLSH  algorithm,  the  following  tests  were  run; 


1.  Parameter  sweep:  Best  choice  for  KLSH  hash  parameters  n,  b,  and  t 

2.  Parameter  sweep:  best  choice  of  RANSAC  iterations 

3.  Feature  extraction  comparison  (SIFT  vs  HOG) 

4.  Parameter  sweep:  best  choice  of  k  for  NN  search 

5.  Data  association  performance:  required  trade-offs  between  memory  use,  accu¬ 
racy,  and  speed 

In  using  parameter  sweeps  to  determine  the  best  set,  multiple  tests  were  accomplished 
varying  the  parameters  snch  that  an  optimal  solntion  could  be  calculated  or  extracted 
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from  the  results.  The  best  choice  for  each  sweep  is  then  used  in  determining  the  over¬ 
all  performance  of  the  algorithm  in  Section  4.3.5.  This  performance  as  well  as  the 
underlying  determination  of  a  good  parameter  is  based  on  an  accuracy  of  match  met¬ 
ric.  Once  KLSH  and  RANSAC  have  run  and  an  answer  is  provided  for  a  given  query, 
the  accuracy  of  that  match  is  calculated  based  on  a  truth  table.  During  preprocessing, 
the  first  and  second  loop  images  were  aligned  by  time  stamp.  A  truth  table  among 
the  two  categories  of  time  stamps  was  generated,  then  verihed  visually.  Recall  the 
image  capture  rate  of  2  Hz.  For  the  large  full  loop  dataset,  the  assumption  was  made 
that  if  the  vehicle  moved  at  an  average  speed  of  1  m/s  then  the  avg  distance  between 
2  images  is  0.5  m.  The  desired  position  error  is  ±2  m.  This  therefore  equates  to  the 
requirement  of  a  matching  image  time  stamp  to  be  within  ±2  seconds  from  the  truth 
to  be  considered  a  positive  match. 

4-3.1  KLSH  parameters.  Determining  a  good  parameter  set  is  key  in  de¬ 
riving  the  neighboring  features  in  the  KLSH  output.  Such  a  parameter  set  optimizes 
accuracy  of  match  with  speed,  minimal  memory  usage,  and  complexity.  To  start,  the 
value  of  n,  or  the  number  of  features  in  the  database  to  be  sampled,  was  analyzed. 
In  contrast  from  [42]  and  leaving  this  parameter  constant  at  300  for  all  database 
sizes,  this  implementation  left  the  determination  of  the  size  of  n  up  to  the  algorithm 
heuristics  mentioned  in  Section  4. 2. 1.1.  In  other  words,  n  varies  based  on  the  number 
of  features,  in  the  database.  Since  the  KLSH  algorithm  is  split  into  l  itera¬ 
tions  of  b  hash  tables,  the  size  of  n  changes  with  every  iteration.  Furthermore,  in  an 
actual  SLAM  implementation,  the  size  of  the  entire  database,  N  will  require  some 
constraints,  the  overall  size  will  continue  to  grow  the  longer  the  agent  travels  in  the 
environment. 

To  analyze  the  effect  the  parameters  have  on  an  individual  match  solution,  this 
implementation  conducted  a  parameter  sweep  varying  the  number  of  hash  buckets  or 
tables,  b,  the  number  of  hashes  per  table,  t  as  well  as  the  size  of  the  dataset,  n.  The 
test  was  then  conducted  as  follows: 
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1.  Manually  select  a  query  and  a  matching  database  image 

2.  Insert  these  as  well  as  the  features  of  Nj  neighboring  images  into  KLSH,  Nj  G 
{7, 10,20,30,40,50} 

(a)  50  features  per  image  are  used. 

(b)  The  hrst  iteration  uses  parameter  Nj  =  7  producing  350  features 

(c)  The  process  is  repeated  for  a  total  of  6  iterations,  each  increasing  the 
number  of  images  added  to  the  database,  thereby  varying  n  by  350  <  n  < 
2500  with  Nj 

3.  Run  KLSH  2500  times  per  iteration  of  N  varying  b,thj  1  <  b  <  500, 1  <  t  <  300 

4.  Assign  a  1  if  the  correct  match  was  one  of  7  NNs  and  a  0  otherwise 

(a)  A  correct  match  is  an  image  whose  IID  is  within  2  secs  of  the  exact  match 

5.  These  steps  were  conducted  using  SIFT  features  and  a  KLSH  Gaussian  kernel 
retrieving  7  NNs 

Fig.  4.4  shows  the  KLSH  performance  as  a  function  of  the  parameters  n,  t,  and 
b.  Each  plot  represents  an  increasing  sample  size,  n,  while  the  rows  and  columns 
represent  the  increasing  number  of  hash  tables  and  hash  bits  per  table  respectively. 
For  a  given  b,t  a  white  color  assignment  (1)  means  that  there  was  a  correct  match  to 
the  query  within  the  top  7  nearest  neighbors  returned.  A  black  color  assignment  (0) 
is  indicative  of  a  false  match. 

These  results  show  that  when  the  dataset  size  is  small,  the  probability  of  match  despite 
the  hash  parameters,  b  and  t  is  large.  As  either  b  or  t  increases,  KLSH  performance 
increases;  however  an  increase  in  both  parameters  simultaneously,  degrades  perfor¬ 
mance  trends  severely.  [42]  found  t  =  30,  6  =  300  to  be  a  successful  parameter  set 
based  on  performance  in  the  low  percentage  of  database  features  required  in  searching 
for  a  query.  This  implementation  used  the  above  plots  to  measure  these  parameters  as 
well.  In  5  of  the  6  iterations  shown,  t  =  30,  b  =  300  produced  a  correct  match.  These 
parameters  also  optimize  memory  efficiency  as  well.  It  was  mentioned  that  either  b 
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Number  of  hash  bits  per  table,  t 


Number  of  hash  bits  per  table,  t 


(a)  n  =  350  (7  images),  (b)  n=1000  features  (20  images) 

n  =  500  features(10  images) 


Number  of  hash  bits  per  table,  t  Number  of  hash  bits  per  table,  t  Number  of  hash  bits  per  table,  t 

(c)  72=1500  features  (30  images)  (d)  72=2000  features  (40  images)  (e)  n=2500  features  (50  images) 

Figure  4.4:  KLSH  performance  varying  n,  sample  sizes,  t,  hash  bits  per 

table  and  b,  hash  tables.  A  white  color  assignment  means  there  was  a 
correct  KLSH  query  match  among  the  7  NNs  returned,  while  a  black 
assignment  means  there  was  no  match. 

and  t  should  be  inversely  proportional  in  size  to  produce  sufficient  accuracy.  Since 
the  t  term  increases  exponentially  in  the  memory  requirement  of  O  {N{t^b)),  the  set 
of  t  =  30,  &  =  300  is  much  more  efficient  than  b  <^t.  The  plots  also  show  that  at  low 
values  of  b  and  t  perform  well  as  well.  For  instance,  b  =  t  =  10  not  only  produces 
good  accuracy  throughout  all  database  sizes,  but  also  optimizes  the  memory  required. 
The  parameter  set  used  for  testing  was  in  accordance  with  [42]  at  6  =  300,  t  =  30 
and  n  =  fi^Np).  In  conclusion  to  the  sweep  conducted,  the  optimal  parameter  set  is 
set  at  &  =  t  =  10. 
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4-3.2  RANSAC  Parameter,  R.  The  number  of  iterations,  R,  in  a  typical 
RANSAC  algorithm  is  a  function  of  the  probability  there  is  a  better  consensus  set 
than  discovered  thus  far.  The  greater  the  number  of  iterations  is  directly  proportional 
to  the  high  probability  that  the  best  set  has  been  reviewed.  Recall  the  calculation  for 
the  minimum  number  of  iterations  required  in  Equation  (3.25).  Instead  of  calculating 
this  parameter  directly,  a  parameter  sweep  was  conducted  varying  the  number  of 
iterations  and  calculating  the  accuracy  of  match.  Fig.  4.5  shows  the  data  association 
accuracy  using  RANSAC  varying  the  number  of  iterations  between  10  —  500.  For  this 
test,  100  SIFT  features  were  used  from  a  1260  image  dataset  and  7  KLSH  nearest 
neighbors  hashed  with  parameters  b  =  300,  t  =  30  from  a  Gaussian  kernel  were  input 
to  RANSAC  with  the  features  from  each  query. 


Figure  4.5:  RANSAC  iteration  parameter  sweep.  This  sweep  shows  the 
accuracy  of  match  in  varying  the  number  of  RANSAC  iterations.  This  test 
was  conducted  using  the  top  100  SIFT  features  and  7  KLSH  NNs  from  a 
dataset  of  a  combined  1260  query  and  database  images. 


As  shown  in  Fig.  4.5,  for  any  number  of  iterations  above  225,  the  probability  of 
accuracy  averages  at  0.95.  This  is  a  commonly  used  algorithm  termination  critereon 
[71],  [23].  Based  on  this  result,  the  number  of  RANSAC  iterations  was  set  at  i?=275. 
This  value  sufficiently  optimizes  algorithm  run  time  as  well  as  accuacy. 
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SIFT  vs  HOG  Features.  The  HOG  descriptor  vector  does  not  perform 
as  well  on  larger  sized  images  as  it  has  previously  on  lower  resolution  images.  The 
accuracy  of  match  for  the  HOG  descriptor  while  still  suitable,  does  not  compare  with 
the  accuracy  of  SIFT.  Fig.  4.6  below  shows  the  accuracy  of  match  from  each  dataset 
while  varying  the  number  of  HOG  (solid)  and  SIFT  (dashed)  features  per  image. 


(b)  510  image  dataset  (c)  1260  image  dataset 


Figure  4.6:  SIFT  vs  HOG.  These  plots  show  the  accuracy  of  match  while 
varying  the  number  of  HOG  and  SIFT  features  used  in  KLSH.  These 
results  were  produced  from  a  Gaussian  kernel,  using  7  NNs. 
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Figure  4.7:  Implementation  Performance:  Probability  of  Accurate  Match 
with  Varying  Number  of  HOG  Features.  This  plot  shows  that  the  accuracy 
of  match  increases  with  the  number  of  HOG  features  used  per  image.  This 
test  was  completed  using  HOG  features  with  a  Gaussian  kernel  on  the  135 
image  dataset. 

The  results  shown  were  run  using  a  Gaussian  kernel  taking  the  top  7  KLSH  nearest 
neighbors  hashed  with  parameters  b  =300  and  t  =30.  In  each  dataset,  the  HOG 
accuracy  of  match  increases  with  the  number  of  features.  This  can  further  be  seen 
in  Fig.  4.7.  Therefore,  in  order  to  obtain  accuracy  comparable  to  that  of  SIFT,  more 
features  are  required.  Gontinuously  increasing  led  to  HOG  probabilities  of  accuracy  of 
0.90+  with  250+  features.  The  requirement  of  twice  as  many  SIFT  features  essentially 
doubles  the  memory  requirements,  O  {2Ns{t^b)),  where  Ns  is  the  number  of  SIFT 
features. 

4-3.4  Best  Choice  of  NN.  To  find  the  best  choice  for  the  number  of  nearest 
neighbors  to  retrieve  from  KLSH,  a  parameters  sweep  was  conducted  analyzing  per¬ 
formance  of  1  —  20  nearest  neighbors.  Fig.  4.8  shows  the  accuracy  at  each  parameter 
throughout  all  datasets. 

For  all  datasets,  there  is  an  average  peak  of  accuracy  around  5  nearest  neighbors. 
This  shows  that  no  matter  what  the  size  of  the  dataset,  this  is  a  good  choice  for  k. 
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High  accuracy  was  also  achieved  aX  k  =  1  nearest  neighbor.  The  difference  in  this 
accuracy  for  k  =  1  and  5  can  in  most  cases  be  considered  negligible  depending  on 
the  application.  Final  determination  on  this  parameter  essentially  comes  down  to 
the  compromises  that  can  be  accepted  between  speed  and  memory  usage  vs  accuracy. 
This  means  that  given  an  entire  dataset,  the  search  for  an  exact  match  to  a  query  can 
be  reduced  to  searching  5  images  rather  than  all.  The  resulting  complexity  therefore 
is  a  trade-off  of  an  increase  from  O  {Np{t%))  for  /c  =  1  to  O  {5Np{t%))  for  /c  =  5 

4-3.5  Data  Association  Performance.  Next  the  parameters  established  in 
the  previous  sections  were  used  to  conduct  the  data  association  on  each  dataset  vary¬ 
ing  the  number  of  features.  Fig.  4.9  shows  plots  of  the  number  of  features  vs  accuracy 
for  both  the  Gaussian  and  Chi-Squared  kernels  for  all  datasets. 

In  all  datasets,  the  Gaussian  kernel  surpassed  the  performance  of  that  of  Chi- 
Squared.  While  the  accuracy  in  all  cases  increases  with  the  increase  in  the  number  of 
features,  KLSH  via  Gaussian  kernel  performs  at  0.95  probability  of  accuracy  with  60 
features  and  above  in  larger  datasets. 

4-3.6  Flaws  in  Accuracy.  The  plots  in  Fig.  4.9  show  that  as  the  datasets  and 
number  of  features  both  increase,  the  accuracy  does  as  well.  In  looking  specihcally 
at  each  association  made,  it  was  noticed  that,  in  general,  it  was  the  same  positions 
in  the  environment  that  were  being  miss-associated.  Therefore  the  degradation  in 
performance  is  more  environment  specific  than  capability  of  the  algorithm.  There 
are  always  going  to  be  certain  areas  in  environments  that  features  will  have  a  hard 
time  picking  up.  The  key  is  to  optimize  feature  extraction  methods  to  minimize  these 
occurrences. 

4-4  Summary 

This  chapter  has  reviewed  test  setup  and  completion.  The  KLSH  hash  parame¬ 
ters  used  produced  results  answering  the  data  association  with  metric  level  precision. 
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The  SIFT  feature  detector  outperformed  HOG  in  accuracy  by  requiring  less  features 
per  image  to  hash  similar  points.  This  difference  in  feature  requirement  is  double  the 
amount  of  memory  required  to  produce  hash  tables  than  needed  for  SIFT.  The  Gaus¬ 
sian  kernel  was  found  to  perform  better  than  the  Ghi-Squared  kernel  throughout  all 
dataset  sizes.  Like  the  HOG  features,  however,  the  Gh-Squared  kernel  does  improve 
in  accuracy  as  the  number  of  features  grows.  This  in  turn  increases  the  complexity, 
and  memory  requirements.  Finally  this  implementation  found  that  the  number  of 
nearest  neighbors  required  to  achieve  optimum  accuracy  can  be  narrowed  between 
k  =  1  and  5.  Although  the  accuracy  is  higher  at  /c  =  5,  there  is  more  of  a  requirement 
for  memory  storage,  while  the  opposite  is  true  for  k  =  1.  Here,  speed  and  memory 
usage  is  optimized  with  a  slight  trade-off  in  accuracy.  The  determination  for  which 
situation  is  more  ideal  is  problem  specihc. 
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(a)  135  image  dataset 


(b)  510  image  dataset  (c)  1260  image  dataset 

Figure  4.8:  Parameter  Sweep:  KLSH  k.  This  plot  shows  the  accuracy 

of  match  produced  with  a  variation  in  the  number  of  nearest  neighbors 
kept.  This  test  was  completed  using  different  numbers  of  SIFT  features, 
according  to  the  legend,  with  a  Gaussian  kernel. 
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(a)  135  image  dataset 


Figure  4.9:  Implementation  Performance:  Gaussian  vs  Chi-Squared  kernel 
comparison.  These  plots  show  the  accuracy  of  match  while  varying  the 
number  of  SIFT  features  used  in  KLSH  for  both  the  Gaussian  (blue)  and 
Chi-Squared  (red)  kernels. 
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V.  Conclusion  &:  Future  Work 


This  research  has  presented  an  implementation  of  the  hash  table-based  nearest  neigh¬ 
bor  search  method,  KLSH  to  conduct  the  data  association.  The  performance  has 
shown  to  answer  the  data  association  of  a  pose  with  metric  level  accuracy  within  a  2 
m.  This  method  therefore  can  be  implemented  as  the  data  association  solution  in  a 
vision-SLAM  algorithm.  This  section  discusses  the  pros  and  cons  to  this  method  as 
well  as  the  areas  for  future  work  found  to  be  required. 

5. 1  Conclusions 

5.1.1  KLSH  Performance.  KLSH  has  shown  to  be  a  strong  nearest  neighbor 
algorithm  resulting  in  optimal  accuracy  narrowing  down  the  search  space  needed  to 
answer  the  data  association.  The  Chi-Squared  kernel  used  in  conjunction  with  SIFT 
features  in  [42]  did  not  perform  as  well  in  this  implementation.  The  Gaussian  kernel 
performed  better  by  far.  Since  images  were  taken  at  a  rate  of  approximately  2  Hz 
with  an  approximate  velocity  of  1  m/s,  a  correct  data  association  provides  metric 
level  precision  down  to  2  m.  This  is  a  remarkable  capability  that  can  be  implemented 
to  produce  accurate  metric  SLAM  maps. 

5.1.2  SIFT  vs  HOG.  This  research  tests  the  HOG  feature  detector’s  robust¬ 
ness  in  using  for  image  navigation  purposes.  In  general,  its  use  in  facial  and  object 
recognition  has  been  found  to  produce  better  results  than  in  this  hashed  based  data 
association  implementation.  The  accuracy  of  match  for  a  particular  query  was  found 
to  be  lower  than  that  of  SIFT  but  increased  with  the  increase  in  features  per  image. 
The  computational  complexity  and  memory  requirements  for  this  increase  in  features 
in  most  problems  will  not  warrant  its  use.  When  online  capability  is  the  ultimate  goal, 
maximum  speed  and  minimum  memory  usage  trade-offs  must  be  optimized.  The  SIFT 
feature  detector  produced  matches  with  greater  than  0.90  probability  of  accuracy  on 
larger  datasets  with  as  few  as  20  features  per  image.  This  yields  a  memory  usage  of 
0{2mF,e{fb)). 
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5.1.3  Evaluation.  Aside  from  the  fact  that  this  implementation  was  done  in 
Matlab®  ,  the  overall  process  is,  however  time  consuming.  By  adding  database  size 
heuristics  and  dividing  hash  table  calculations  into  multiple  iterations,  the  speed  of 
the  system  was  greatly  increased  without  loss  in  efficiency  thereby  optimizing  use  in 
image  mapping.  The  goal  of  online  capability  may  be  able  to  be  accomplished  given 
the  number  of  features  used  per  image  remains  relatively  low.  System  constraints 
are  needed,  however  to  determine  an  actual  requirement.  Finally,  it  was  determined 
that  there  is  a  trade-off  in  terms  of  optimization  in  the  number  of  nearest  neighbors 
required.  As  the  number  of  nearest  neighbors  increased  past  5,  the  accuracy  decreased. 
Therefore,  at  most,  KLSH  was  found  to  require  no  more  than  5  images  to  search  in 
matching  a  query.  In  the  case  of  the  1260  image  dataset  with  over  630  database 
images,  thats  only  1%  of  the  entire  database.  It  was  also  discovered  however,  that 
at  k=l,  while  the  probability  of  accuracy  was  not  as  high  it  was  still  good,  above 
0.90  in  most  cases.  This  tradeoff  in  accuracy  saves  computation  time  and  memory 
by  searching  through  just  1  image  instead  of  all  5.  The  conditions  warranting  the 
tradeoff  are  problem  dependent  but  in  most  cases  the  difference  in  accuracy  may  be 
negligible.  The  downfall  to  KLSH  is  that  as  the  database  size  continues  to  grow, 
speed  will  decrease  and  memory  usage  will  increase.  While  it  was  shown  that  the 
number  of  hash  tables  and  bit  length  of  each  need  not  change  with  the  increase  in 
daatbase  size,  memory  usage  increases  by  t‘^b  with  each  added  feature. 

5.2  Future  Work 

This  implementation  unfortunately  wasn’t  able  to  compare  the  k-d  tree  method 
using  the  same  datasets  tested  on.  Therefore  an  accurate  comparison  of  which  method 
is  better  could  not  be  accomplished.  This  is  the  hrst  area  that  needs  to  be  addressed 
in  future  work.  This  implementation  also  needs  to  be  tested  in  a  vision-SLAM  im¬ 
plementation  and  optimized  for  use  online.  While  the  dimensionality  required  to 
accurately  describe  image  features  has  always  been  much  larger  than  that  of  other 
sensor  methods,  this  implementation  performs  well  optimizing  accuracy  across  all 
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database  sizes  while  searching  a  fraction  of  a  percentage  of  the  data.  The  resulting 
map  produced  in  SLAM  will  determine  whether  or  not  the  degradations  in  accuracy 
found  would  impede  the  overall  localization  determination.  Areas  of  concern  for  use 
online  lie  in  database  size.  Further  heuristics  are  required  to  limit  the  database  size 
while  still  being  able  to  accurately  associate  similar  points.  Furthermore,  loop  closure 
techniques  need  to  be  implemented  when  completing  in  conjunction  with  SLAM. 

While  this  algorithm  performs  well  at  a  low  number  of  features,  the  system 
could  be  further  optimized  with  the  use  of  a  region  or  interest  point  detector  that 
uses  general  position  estimation  techniques  to  estimate  a  feature’s  position  in  the  next 
frame.  Tracking  fewer  features  from  image  to  image  like  this  as  is  done  in  SLAM  will 
lower  the  overall  number  of  features  in  the  database.  This  will  not  speed  the  algorithm 
in  terms  of  dimensionality  complexity  but  will  aid  in  lowering  the  number  entries  in 
the  overall  database.  Another  method  to  decrease  the  size  of  the  database  is  to  use 
variations  of  features  as  in  [18].  Similar  to  SLAM  feature  tracking  techniques,  when 
a  feature  is  tracked  in  an  environment,  variations  of  that  feature  in  subsequent  scenes 
are  combined  and  associated  with  a  covariance.  This  ensures  that  changes  in  changes 
in  feature  appearance  as  it  gets  closer  done  hinder  tracking  and  data  association. 
This  also  minimizes  the  number  of  additional  features  that  need  to  be  saved  in  the 
database  when  tracking  features  through  the  environment. 
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Appendix  A.  Supplementary  Math 


A.l  Hamming  Distance  Metric 

Given  a  dataset  P,  let  C  be  the  largest  coordinate  across  all  dimensions  and 
all  samples.  This  allows  V  to  be  embedded  into  the  Hamming  cube  with  d'  =  Cd. 
This  is  done  by  representing  each  point  p  by  a  binary  vector  in  the  form  of  the  unary 
representation.  The  unary  representation  of  a  number,  xi  is  a  vector  of  ones  with 
length(a;i)  followed  hy  C  —  xi  zeros. 

If  G  =  10,  Xi  =  5,  the  Unary =  [1111100000] 

The  resulting  binary  representation  of  a  point  in  P  is 


vip)  =  VnaiYcixi), ...,  Vnsnjcixd)-  (A.l) 

Consider  the  2D,  point  p  =  [3  5].  The  resulting  Unary  representation  based  on 
G  =  5  is: 

n(p)  =  [1110011111]  (A.2) 

The  Hamming  distance,  dn  of  points  p,  q  in  the  set  {1, ...,  d'}  is 

A>(p,  q)  =  dHivip),v{p))  (A.3) 

in  which  the  function  dn  is  the  sum  of  the  XOR  of  the  input  terms. 

A.2  Central  Limit  Theorem 

Consider  a  sequence  of  i.i.d.  random  variables  each  with  a  hnite  mean,  y  and  a 
finite  non-zero  variance,  The  sum  of  the  first  n  is  represented  by 

Sn  =  Xi+X2...+Xn  (A.4) 

A  zero-mean,  unit  variance  random  variable,  Zn  can  then  be  calculated  as 
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(A.5) 


— 


(S'n  -  n/i) 


ax/n 


The  central  limit  theorem  states  that  as  the  number  of  states,  n  in  a  sample 
becomes  large,  the  cdf  of  the  normalized  Sn  will  resemble  that  of  a  Gaussian  random 
variable. 


1 

lim  P[Zn  <  z]  =  /  dx  (A. 6) 

V27r  J-oo 

This  Gaussian  approximation  holds  true  for  any  distribution  as  long  as  there  is 
a  hnite  mean,  hnite  non-zero  variance  and  a  properly  normalized  sum  [45] . 

A. 3  2D  Homography  Estimation 


Figure  A.l:  Homography  Example. 


Gonsider  the  2  images  in  Fig.  A.l,  P  and  P'  taken  of  the  same  scene  who’s  only 
differences  may  include  rotation,  translation  and  scale.  Since  the  points  in  these  two 
images  are  in  different  planes  or  spaces,  linear  comparisons  based  on  position  cannot 
be  completed.  A  homography  is  a  point  to  point  mapping  between  two  spaces.  In 
this  example,  a  homography  enables  this  pair  of  images  to  be  compared  to  complete 
tasks  such  as  seeing  how  alike  they  are  or  more  commonly,  image  mosaicking.  This 
enables  multiple  images  of  the  same  scene  to  be  ’’pieced  together”  in  the  event  they 
were  cropped  apart  as  in  Fig.  A. 2  or  paired  together  to  produce  wide  angle  panoramic 
views.  This  type  of  image  registration  is  known  as  image  mosaicking. 
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m 

(d)  (e) 

(f) 

Figure  A. 2:  Mosaicking  example.  Orginal  image  (a).  Various  crops  with 

rotation  (b-e).  Mosaicked  image  (f) 

Fig.  A. 2  is  an  example  of  the  image  scenario  discussed  above.  The  two  images  are 
from  the  same  scene  however,  the  second  varies  in  translation  and  scale.  The  following 
derivations  calculate  the  homography  estimation  for  mapping  two  images  [23]. 

The  mapping  of  the  points  p  in  the  first  image  to  the  corresponding  points  p' 
in  the  second  image  dehnes  the  homography  as 

wp'  =  Hp  (A.  7) 

where  H  is  a  3x3  transformation  matrix.  It  is  important  to  note  that  the  scale 
between  the  images  is  arbitrary  and  not  dehned  in  the  matrix  leaving  8  degrees  of 
freedom.  A  2D  point  has  2  degrees  of  freedom  per  point  mapping.  Therefore  a 
minimum  of  4  points  are  required  to  solve.  The  system  can  be  solved  with  more  than 
4  points;  this  is  known  as  an  overdetermined  system  in  which  an  approximate  solution 
can  be  determined  by  least  squares. 

Let  p  =  {x,  y,  1)  and  p'  =  {x',  y' ,  1)  be  corresponding  points  in  images  P  and  P' . 
Solving  the  homography  estimation  given  in  equation  (A.  7)  yields 
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wx'  hu  hi2  hi3  X 

wy'  =  /i2i  h22  h23  y  (A.8) 

w  h^i  /l32  /i33  1 

As  mention  above,  the  scale  parameter  cancels  when  writing  out  the  equations  leaving 
hiix  +  hi2y  +  hi3  -  h^ixx'  -  h32yx'  -  x'  =  0 

(A.9) 

h2lX  +  h22y  +  ^23  -  h^ixy'  -  h32yy'  -  1/'  =  0 

Expressing  this  as  the  linear  system  Ah  =  0  yields  the  following  2nx9  matrix  A,  in 
which  n  is  the  number  of  points  used  to  calculate  the  homography  in  each  image 

xi  yi  1  9  0  0  —xix'i  —yix'i  —x'^ 

0  0  0  xi  yi  1  -xiy[  -yiy'i  -y[ 

X2  y2  ^  9  0  0  -X2X2  -y2X2  -x'2 

0  0  0  a;2  1/2  1  -X2y2  -I/2I/2  -I/2 

Xn  yn  ^  ^  0  0  -XnX'^  -VnX'n  -x'^ 

0  9  9  Xn  yn  I  -Xny'n  -Vny’n  “l/n 

h,  then  is  a  9x1  vector  comprising  the  terms  of  the  homography  matrix  H.  Solving 
for  H  is  now  accomplished  using  singular  value  decomposition  on  A  yielding 

A  =  USV'  (A.ll) 

in  which  [/  and  V  are  the  eigenvectors  of  A  A'  and  A'  A  respectively  and  S,  the  singular 
values  which  are  the  square  roots  of  the  above  eigenvectors.  The  h  vector  is  equal 
to  the  eigenvector  corresponding  to  the  smallest  eigenvalue  of  A.  This  value  is  0 


70 


if  the  system  is  exactly  determined,  or  has  only  4  points,  or  is  closest  to  0  in  an 
overdetermined  systems. 

The  h  vector  forms  the  homography  matrix,  H  as  follows: 

hi  ^2  hs 

H=  hi  h5  he  .  (A.12) 

hr  hg  hg 
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